
THE 

INTERNATIONAL SERIES 

OF 

MONOGRAPHS ON PHYSICS 

OBNBRAL EDITORS 

tR. H. FOWLER. P. KAPITZA 
N. F. MOTT. E. C. BULLARD 



THE INTERNATIONAL SERIES OF 
MONOGRAPHS ON PHYSICS 

GENERAL EDITORS 

The U-TB Sib E AL PH FOWLER 
P. KAPITZA 

N. F. MOTT E. C. BULLARD 

Melrille Wills Professor of TheoraUcal Professor of Physics, University of 
Physics in the University of Bristol, Toronto. 

Alrtady Published 

THE THEORY OF ELECTRIC AND MAGNETIC SUSCEPTIBILITIES. By 
j. H. VAN vtECK. 1932. Royal 8vo, pp. 396. 

THE THEORY OF ATOMIC COLLISIONS. By N. F. mott and H. 8. w. masset. 

1933. Royal 8vo, pp. 300. 

RELATIVITY, THERMODYNAMICS, AND COSMOLOGY. By b. c. tolman, 

1934. Royal 8vo, pp. 518. 

ELECTROLYTES. By bans palkenhaobn. Translated hy ». p. beli,. 1934. 
Royal 8vo, pp. 364. 

CHEMICAL KINETICS AND CHAIN REACTIONS. By n. skmknoff. 1935. 
Royal 8ro, pp. 492. 

RELATIVITY, GRAVITATION, AND WORLD-STRUCTURE. By B. a . milne, 

1935. Royal 8vo, pp. 378. 

THE QUANTUM THEORY OF RADI.ATION. By w. hbitleb. Stconi Edition. 
1944. Royal 8vo, pp. 204. 

THEORETICAL ASTROPHYSiaS; ATOMIC THEORY AND THE ANALYSIS 
OF STELLAR ATMOSPHERES AND ENVELOPES. By s. bosseland. 1936. 
Royal 8vo, pp. 376. 

THE THEORY OF THE PROPERTIES OF METALS AND ALLOYS. By n. f. 

MOTT and H. JONBS. 1930, Royal 8vo, pp. 340. 

ECLIPSES OF THE SUN AND MOON. By sib pbank dvson and a. v. d. a. 
WOOLLEY. 1937. Royal 8vo, pp. 168. 

THE PRINCIPLES OF STATISTICAL MECHANICS. By a. c. tolman. 1938. 
Royal 8vo, pp. 682. 

THE ULTRACENTRIFUGE. By the svedbebq and kai o. fbdebsen. 1940. 
Royal 8vo, pp. 488. 

ELECTRONIC PROCESSES IN IONIC CRYSTALS. By u. p. mott and a. w. 
GUBNET. 1940. Royal 8vo, pp. 275, 

GEOMAGNETISM. By 8. chapman and J. babtels. 1940. Royal 8vo, 2 vols., 
pp. 1076. 

THE SEPARATION OF GASES. By M. bcheuann. Second Impression. 1945. 
Royal 8vo, pp. 298. 

KINETIC THEORY OF LIQUIDS. Byj.PBBNEEL. 1946. Royal 8vo, pp. 500. 

THE PRINCIPLES OF QUANTUM MF.CHANICS. By p. A. M. bihac. Third 
Edition. 1947. Royal 8vo, pp. 324. 

COSMIC RAYS. By l. janossy. 1948. Royal Svo, pp. 440. 



THEORY OF 

PROBABILITY 

BY 

HAROLD JEFFREYS 

M.A., D.Sc., F.R.S. 

PLCMIAK PROFESSOR OF ASTRONOMY 
UNIVERSITY OF CAMBRIDGE 


SECOND EDITION 


OXFORD 

AT THE CLARENDON PRESS 

1948 



Oxford University Presa^ Amen Housty L&ndon E.C. 4 

0I.A600W NEW YORE TORONTO MELBOITRNB WELUNQTON 
BOMBA.Y CAliCUTTA MADRAS CAPS TOWN 

Oeoffrey Cumlferlege, Publisher to the University 


FEINTED IN GREAT BRITAXN 



PREFACE TO THE SECOND EDITION 


In the circumstances that have prevailed in the world since the appear- 
ance of this book, jt is a welcome indication of increasing interest in the 
principles of scientific method that a second edition has been required. 
I have teken the opportunity to add some arguments that go far towards 
establishing the consistency of the product rule and therefore of the 
principle of inverse probability. A theory of invariance has been 
developed and applied to problems of estimation and significance, thus 
establishing the possibility of a consistent rule for stating prior proba- 
bilities over large parts of the subject. I am not satisfied that it is the 
only such rule or even the best one, but think that enough progress has 
been made to indicate that the attempt is worth pursuing. 

I have not attempted to answer explicitly the criticisms made by 
reviewers, because on examination I found that they were all dealt with 
in the book already. What does strike me as remarkable is that no 
mention was made of the fact that the book contained useful methods 
of treatment of several problems of practical importance. I have still 
not gathered what distinction those statisticians who do not accept 
the epistemological approach draw between estimation problems and 
significance tests, or whether they think that they are saying anything 
about a hypothesis when they reject it. So far as I can judge from 
their pronouncements, they provide themselves with no reason against 
continuing to make predictions from it. 

Several recent writers, especially in the United States, have described 
me as a follower of the late Lord Keynes. Without wishing to disparage 
Keynes, I must point out that the first two papers by Wrinch and me in 
the Philosophical Magazine of 1919 and 1921 preceded the publication of 
Keynes’s book. What resemblance there is between the present theory 
and that of Keynes is due to the fact that Broad, Keynes, and my col- 
laborator had all attended the lectures of W. E. Johnson. Keynes’s 
distinctive contribution was the assumption that probabilities are only 
partially ordered ; this contradicts my Axiom 1. I gave reasons for not 
accepting it in Scientific Inference. Keynes himself withdrew it in his 
biographical essay on F. P. Ramsey, 

I have to thank several correspondents for suggesting corrections, 
especially Dr. H. Chojnacki-Hanani. Mr. P. H. Diananda of Gains 
College and Mr. V. S. Huzurbazar of Fitzwilliam House, Cambridge, 
have helped greatly in the proof-correction. g j 

ST. John’s oollboe, oaubbidob 
October 1947 



PREFACE TO THE FIRST EDITION 


The chief object of this work is to provide a method of drawing infer- 
ences from observational data that will be self-consistent and can also 
be used in practice. Scientific method has grown up without much 
attention to logical foundations, and at present there is little relation 
between three main groups of workers. Philosophers, mainly interested 
in logical principles but not much concerned with specific applications, 
have mostly followed in the tradition of Bayes and Laplace ; but with 
the brilliant exception of Professor C. D. Broad have not paid much 
attention to the consequences of adhering to the tradition in detail. 
Modem statisticians have developed extensive mathematical techniques, 
but for the most part have rejected the notion of the probability of a 
hypothesis, and thereby deprived themselves of any way of saying 
precisely what they mean when they decide between hypotheses. 
Physicists have been described, by an experimental physicist who has 
devoted much attention to the matter, as not only indifferent to funda- 
mental analysis but actively hostile to it ; and with few exceptions their 
statistical technique has hardly advanced beyond that of Laplace. In 
opposition to the statistical school, they and some other scientists are 
liable to say that a hypothesis is definitely proved by observation, 
which is certainly a logical fallacy ; most statisticians appear to regard 
observations as a basis for possibly rejecting hypotheses, but in no case 
for supporting them. The latter attitude, if adopted consistently, 
would reduce all inductive inference to guesswork; the former, if 
adopted consistently, would make it impossible ever to alter the hypo- 
theses, however badly they agreed with new evidence. The present 
attitudes of most physicists and statisticians are diametrically opposed, 
but lack of a common meeting-ground has, to a very large extent, pre- 
vented the opposition from being noticed. Nevertheless, both schools 
have made great scientific advances, in spite of the fact that their 
fundamental notions, for one reason or the other, would make such 
advances impossible if they were consistently maintained. 

In the present book I reject the attempt to reduce induction to 
deduction, which is characteristic of both schools, and maintain that 
the ordinary common-sense notion of probabihty is capable of precise 
and consistent treatment when once an adequate language is provided 
for it. It leads to the result that a precisely stated hypothesis may 
attain either a high or a negligible probability as a result of observa- 
tional data, and therefore to an attitude intermediate between those 
current in physics and statistics, but- in accordance with ordinary 



PREFACE TO THE FIRST EDITION vii 

thought. Fundamentally the attitude is that of Bayes and Laplace, 
though it is found necessary to modify their hypotheses before some 
t3T5e8 of cases not considered by them can be treated, and some steps 
in the argument have been filled in. For instance, the rule for assessing 
probabilities given in the first few lines of Laplace’s book is Theorem 7, 
and the principle of inverse probability is Theorem 10. There is, on the 
whole, a very good agreement with the recommendations made in 
statistical practice ; my objection to current statistical theory is not so 
much to the way it is used as to the fact that it limits its scope at the 
outset in such a way that it cannot state the questions asked, or the 
answers to them, within the language that it provides for itself, and 
must either appeal to a feature of ordinary language that it has declared 
to be meaningless, or else produce arguments within its own language 
that will not bear inspection. 

The most beneficial result that I can hope for as a consequence of 
this work is that more attention will be paid to the precise statement 
of the alternatives involved in the questions asked. It is sometimes 
considered a paradox that the answer depends not only on the observa- 
tions but on the question ; it should be a platitude. 

The theory is applied to most of the main problems of statistics, and 
a number of specific applications are given. It is a necessary condition 
for their inclusion that they shall have interested me. As my object is 
to produce a general method I have taken examples from a number of 
subjects, though naturally there are more from physics than fr’om 
biology and more from geophysics than from atomic physics. It was, 
as a matter of fact, mostly with a view to geophysical applications that 
the theory was developed. It is not easy, however, to produce a 
statistical method that has application to only one subject; though 
intraclass correlation, for instance, which is a matter of valuable posi- 
tive discovery in biology, is usually an unmitigated nuisance in physics. 
It may be felt that many of the applications suggest further questions. 
That is inevitable. It is usually only when one group of questions has 
been answered that a further group can be stated in an answerable form 
at all. 

I must offer my warmest thanks to Professor R. A. Fisher and Dr. J. 
Wishart for their kindness in answering numerous questions fi'om a not 
very docile pupil, and to Mr. R. B. Braithwaite, who looked over the 
manuscript and suggested a number of improvements; also to the 
Clarendon Press for their extreme courtesy at all stages. 

H. J. 

ST. JOHN’S OOLLBOS, CAMBBIDGB 



CONTENTS 


I. FUNDAMENTAL NOTIONS .... 1 

II. DIRECT PROBABILITIES . . . .47 

ni. ESTIMATION PROBLEMS . . .99 

IV. APPROXIMATE METHODS AND SIMPLIFICATIONS . 168 

V. SIGNIFICANCE TESTS: ONE NEW PARAMETER . 220 

VI. SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS . 306 

Vn. FREQUENCY DEFINITIONS AND DIRECT METHODS . 341 

VIII. GENERAL QUESTIONS . . . . .372 

APPENDIX. TABLES OF K . . . . .396 

NOTE ON THE CONSISTENCY OF THE PRODUCT RULE . 406 

NOTE ON THE INFINITE REGRESS ARGUMENT . . 407 


INDEX 


408 



I 


FUNDAMENTAL NOTIONS 

They say that Understanding ought to work by the rules of right reason. 
These rules are, or ought to be, contained in Logic ; but the actual science of 
logic is conversant at present only with things either certain, impossible, or 
entirely doubtful, none of which (fortunately) we have to reason on. Therefore 
the true logic for this world is the calculus of Probabilities, which takes account 
of the magnitude of the probability which is, or ought to be, in a reasonable 
man’s mind. 

J. CXebe MaxwEia- 

1.0. The fundamental problem of scientific progress, and a fundamental 
one of everyday life, is that of learning from experience. Knowledge 
obtained in this way is partly merely description of what we have already 
observed, but part consists of making inferences from past experience 
to predict future experience. This part may be called generalization or 
induction. It is the most important part; events that are merely 
described and have no apparent relation to others may as well be for- 
gotten, and in fact usually are. The theory of learning in general is 
the branch of logic known as epistemology. A few illustrations will 
indicate the scope of induction. A botanist is confident that the plant 
that grows from a mustard seed will have yellow flowers with four long 
and two short stamens, and four petals and sepals, and this is inferred 
from previous instances. The Nautical Almanac' 9, predictions of the 
positions of the planets, an engineer’s estimate of the output of a new 
dynamo, and an agricultural statistician’s advice to a farmer about the 
utility of a fertilizer are all inferences from past experience. When a 
musical composer scores a bar he is expecting a defimte series of sounds 
when an orchestra carries out his instructions. In every case the 
inference rests on past experience that certain relations have been 
found to hold; and those relations are then applied to new cases that 
were not part of the original data. The same applies to my expectations 
about the flavour of my next meal. The process is so habitual that 
we hardly notice it, and we can hardly exist for a minute without carry- 
ing it out. On the rare occasions when anybody mentions it, it is called 
common sense and left at that. 

Now such inference is not covered by logic, as the word is ordinarily 
understood. Traditional or deductive logic admits only three attitudes 
to any proposition: definite proof, disproof, or blank ignorance. But 
no number of previous instances of a rule will provide a deductive proof 

a59S.S8 n 



FUNDAMENTAL NOTIONS 


2 


Chap. I 


that the rule will hold in a new instance. There is always the formal 
possibility of an exception. 

Deductive logic and its close associate, pure mathematics, have been 
developed to an enormous extent, and in a thoroughly systematic way 
— indeed several ways. Scientific method, on the other hand, has grown 
up more or less haphazard, techniques being developed to deal with 
problems as they arose, without much attempt to unify them, except 
so far as most of the theoretical side involved the use of pure mathe- 
matics, the teaching of which required attention to the nature of some 
sort of proof. Unfortunately the mathematical proof is deductive, and 
induction in the scientific sense is simply unintelligible to the pure 
mathematician — as such; in his unofficial capacity he may be able to 
do it very well. Consequently little attention has been paid to the 
nature of induction, and apart from actual mathematical technique the 
relation between science and mathematics has done little to develop a 
connected account of the characteristic scientific mode of reasoning. 
Many works exist claiming to give such an account, and there are some 
highly useful ones deahng with methods of treating observations that 
have been found useful in the past and may be found useful again. 
But when they try to deal with the underlying general theory they 
suffer from all the faults that modern pure mathematics has been try- 
ing to get rid of: self-contradictions, circular arguments, postulates 
used without being stated, and postulates stated without being used. 
Running through the whole is the tendency to claim that scientific 
method can be reduced in some way to deductive logic, which is the 
most fundamental fallacy of all: it can be done only by rejecting its 
chief feature, induction. 

The principal field of application of deductive logic is pure mathe- 
matics, which pure mathematicians recognize quite frankly as dealing 
with the working out of the consequences of stated rules with no 
reference to whether there is anything in the world that satisfies those 
rules. Its propositions are of the form ‘If p is true, then q is true’, 
irrespective of whether we can find any actual instance where p is true. 
The mathematical proposition is the whole proposition, ‘If p is true, 
then q is true’, which may be true even if p is in fact always false. In 
applied mathematics, as usually taught, general rules are asserted as 
applicable to the external world, and the consequences are developed 
logically by the technique of pure mathematics. If we inquire what 
reason there is to suppose the general rules true, the usual answer is 
simply that they are known from experience. However, this use of the 



FUNDAMENTAL NOTIONS 


3 


51-0 


word ‘experience’ covers a confusion. The rules are inferred from past 
experience, and then applied to future experience, which is not the same 
thing. There is no guarantee whatever in deductive logic that a rule 
that has held in all previous instances will not break down in the next 
instance or in all future instances. Indeed there are an infinite number 
of rules that have held in all previous cases and caimot possibly all 
hold in future ones. For instance, consider a body falling freely under 
gravity. It would be asserted that the distance at time t below a fixed 
level is given by a formula of the type 

a = a+vi+yt^. (1) 

This might be asserted from observations of s at a series of instants 
ij, < 2 ,..., That is, our previous experience asserts the proposition 
that a, u, and g exist such that 

= a-lryir+\gtl ( 2 ) 

for all values of r from 1 to n. But the law (1 ) is asserted for all values 
of t. But consider the law 

5 = (3) 

where /(<) may be any function whatever that is not infinite at any of 
tj, <«> and a, u, and g have the same values as in (1). There are an 
infinite number of such functions. Every form of (3) will satisfy the 
set of relations (2), and therefore every one has held in aU previous 
cases. But if we consider any other instant (which might be either 
within or outside the range of time between the first and last of the 
original observations) it will be possible to choose in such a way 
as to give a as found from (3) any value whatever at time Further, 
there will be an infinite number of forms of/(<) that would give the same 
value of and there are an infinite number that would give 

different values. If we observe a at time <„+j, we can choose to 
give agreement with it, but an infinite number of forms of/(t) consistent 
with this value would be consistent with any arbitrary value of s at a 
further moment That is, even if all the observed values agree with 
(1) exactly, deductive logic can say nothing whatever about the value 
of a at any other time. An infinite number of laws agree with previous 
experience, and an infinite number that have agreed with previous ex- 
perience will inevitably be wrong in the next instance. What the applied 
mathematician does, in fact, is to select one form out of this infinity; 
and his reason for doing so has nothing whatever to do with traditional 
logic. He chooses the simplest. This is actually an understatement of 
the case; because in general the observations will not agree with (1) 



4 FUNDAMENTAL NOTIONS Chap. I 

exactly, a polynomial of n terms can stiU be found that will agree exactly 
with the observed values at times and yet the form (1) may 

still be asserted. Similar considerations apply to any quantitative law. 
The further discussion of this matter must be reserved till we come to 
significance tests. We need notice at the moment only that the choice 
of the simplest law that fits the facts is an essential part of procedure 
in applied mathematics, and cannot be justified by the methods of 
deductive logic. It is, however, rarely stated, and when it is stated it 
is usuaUy in a manner suggesting that it is something to be ashamed 
of. We may recall the words of Brutus. 

But ’tis a common proof 
That lowliness is young ambition’s ladder, 

Whereto the climber upwards turns his face ; 

But when he once attains the upmost round, 

He then unto the ladder turns his back, 

Looks in the clouds, scorning the base degrees 
By which he did ascend. 

It is asserted, for instance, that the choice of the simplest law is purely 
a matter of economy of description or thought, and has nothing to do 
with any reason for believing the law. No reason in deductive logic, 
certainly; but the question is. Does deductive logic contain the whole 
of reason ? It does give economy of description of past experience, but 
is it unreasonable to be interested in future experience ? Do we make 
predictions merely because those predictions are the easiest to make ? 
Does the Nautical Almanac Office laboriously work out the positions 
of the planets by means of a complicated set of tables based on the 
law of gravitation and previous observations, merely for convenience, 
when it might much more easily guess them? Do sailors trust the 
safety of their ships to the accuracy of these predictions for the same 
reason? Does a town install a new tramway system, with expensive 
plant and much preliminary consultation with engineers, with no more 
reason to suppose that the trams will move than that the laws of 
electromagnetic induction are a saving of trouble? I do not believe 
for a moment that anybody will answer any of these questions in the 
affirmative; but an affirmative answer is implied by the assertion that 
is still frequently made, that the choice of the simplest law is merely a 
matter of convention. I say, on the contrary, that the simplest law is 
chosen because it is the most likely to give correct predictions; that the 
choice is based on a reasonable degree of belief; and that the fact that 
deductive logic provides no explanation of the choice of the simplest 
law is an absolute proof that deductive logic is grossly inadequate to 



§ 1.0 


FUNDAMENTAL NOTIONS 


6 


cover scientific and practical requirements. It is sometimes said, again, 
that the trust in the simple law is a peculiarity of human psychology; 
a different t3rpe of being might behave differently. Well, I see no point 
whatever in discussing at length whether the human mind is any use; 
it is not a perfect reasoning instrument, but it is the only one we have. 
Deductive logic itself could never be known without the human mind. 
If anybody rejects the human mind and then holds that he is construct- 
ing valid arguments, he is contradicting himself; if he holds that human 
minds other than his own are useless, and then hopes to convince them 
by argument, he is again contradicting himself. A critic is himself 
using inductive inference when he expects his words to convey the same 
meaning to his audience as they do to himself, since the meanings of 
words are learned first by noting the correspondence between things 
and the sounds uttered by other people, and then applied in new 
instances. On the face of it, it would appear that a general state- 
ment that something accepted by the bulk of mankind is intrinsically 
nonsense requires much more to support it than a mere declaration. 

Many attempts have been made, while accepting induction, to claim 
that it can be reduced in some way to deduction. Bertrand Russell 
has remarked that induction is either disguised deduction or a mere 
method of making plausible guesses.f In the former sense we must look 
for some general principle, which states a set of possible alternatives; 
then observations are used to show that all but one of these are wrong, 
and the survivor is held to be deductively demonstrated. Such an 
attitude has been widely advocated. On it I quote Professor C. D. 
Broad.J 

‘The usual view of the logic books seems to be that inductive argiunents are 
really syllogisms with propositions summing up the relevant observations as 
minors, and a common major consisting of some universal proposition about 
nature. If this were true it ought to be easy enough to find the missing major, 
and the singular obscurity in which it is enshrouded would be quite inexplicable. 
It is reverently referred to by inductive logicians as the Uniformity of Nature ; 
but, as it is either never stated at all or stated in such terms that it could not 
possibly do what is required of it, it appears to be the inductive equivcdent of 
Mrs. Gamp’s mysterious friend, and might be more appropriately named Major 
Harris. 

t Principles of Mathematics, p. 360. He said, at the Aristotelian Society svimmer 
meeting in 1938, that this remark has been too much quoted. I therefore offer apologies 
for quoting it again. He has also remarked that the inductive philosophers of Central 
Africa formerly held the view that all men were black. My comment would be that 
the deductive ones, if there were any, did not hold that there were any men, black, 
white, or yellow. 

tMind, 29, 1920, 11. 



6 


fundamental notions 


Chap. I 


‘ It is in fact easy to prove that this whole way of looking at inductive argu- 
ments is mistaken. On this view they are all syllogisms with a common major. 
Now their minors are propositions summing up the relevant observations. If the 
observations have been carefully made the minors are practically certain. Hence, 
if this theory were true, the conclusions of all inductive arguments in which the 
observations were equally carefully made would be equally probable. For what 
could vary the probabilities ? Not the major, which is common to all of them. 
Not the minors, which by hypothesis are equally certain. Not the mode of 
reasoning, which is syllogistic in each case. But the result is preposterous, end 
is enough to refute the theory which leads to it.’ 

Attempts Tiave been made recently to supply the missing major by 
several modern physicists, notably Sir Arthur Eddington and Professor 
E. A. Milne. But their general principles and their results differ even 
within the very limited field of know^ledge where they have been 
apphed. How is a person with less penetration to know which is right, 
if any ? Only by comparing the results w’ith observation ; and then his 
reason for believing the survivor to be likely to give the right results 
in future is inductive. I am not denying that one of them may have 
got the right results. But I reject the statement that any of them can 
be said to be certainly right as a matter of pure logic, independently of 
experience; and I gravely doubt whether any of them could have been 
thought of at all had the authors been unaw are of the vast amount of 
previous work that had led to the estabhshment by inductive methods 
of the laws that they set out to explain. These attempts, though they 
appear to avoid Broad’s objection, do so only within a limited range, 
and it is doubtful whether such an attempt is worth making if it can 
at best achieve a partial success, when induction can cover the whole 
field without supposing that special rules hold in certain subjects. 

I should maintain (with N. R. Campbell, who saysf that a physicist 
would be more likely to interchange the two terms in Russell’s state- 
ment) that a great deal of what passes for deduction is really disguised 
induction, and that even some of the postulates of Principia Mathe- 
matica are adopted on inductive grounds (which, incidentally, are false). 

Two attempts at a justification of induction, still sometimes made, 
are as follows. (1) Induction has worked in the past; therefore it will 
work in the future. It is obvious that this is itself an inductive inference 
and involves the same problems in a more complicated way. (2) The 
struggle for existence would favour members with the ability to predict 
correctly the consequences of their actions. Consequently the fact that 
man has survived implies that he has this ability (and presumably 


t Phytics, The Elements, 1920, 9. 



FUNDAMENTAL NDTIONS 


7 


S 1.0 

Amoeba has too). But the belief that there is a struggle for existence 
and that it favours particular types is based on induction. Both argu- 
ments replace the original question by another as difficult or more so, 
and take no effective step towards a solution. 

Karl Pearsonf writes as follows: 

‘Now this is the peculiarity of scientific method, that when once it has become 
a habit of mind, that mind converts all facts whatsoever into science. The field 
of science is unlimited ; its material is endless, every group of natural phenomena, 
every phase of social life, every stage of past or present development is material 
for science. The unity of all science consists alone in its method, not in its material. 
The man who classifies facts of any kind whatever, who sees their mutual relation 
and describes their sequences, is applying the scientific method and is a man of 
science. The facts may belong to the past history of mankind, to the social 
statistics of our great cities, to the atmosphere of the most distant stars, to the 
digestive organs of a worm, or to the life of a scarcely visible bacillus. It is not 
the facts themselves which form science, but the methods by which they are 
dealt with.’ 

Here, in a few sentences, Pearson sets our problem. The italics are his. 
He makes a clear distinction between method and material. No matter 
what the subject-matter, the fundamental principles of the method 
must be the same. There must be a uniform standard of validity for 
all hypotheses, irrespective of the subject. Different laws may hold in 
different subjects, but they must be tested by the same criteria; other- 
wise we have no guarantee that our decisions will be those warranted 
by the data and not merely the result of inadequate analysis or of 
believing what we want to believe. An adequate theory of induction 
must satisfy two conditions. First, it must provide a general method; 
secondly, the principles of the method must not of themselves say any- 
thing about the world. If the rules are not general, we shall have 
different standards of validity in different subjects, or different 
standards for one’s own hypotheses and somebody else’s. If the rules 
of themselves say anything about the world, they will make empirical 
statements independently of observational evidence, and thereby limit 
the scope of what we can find out by observation. If there are such 
limits, they must be inferred from observation; we must not assert them 
in advance. 

We must notice at the outset that induction is more general than 
deduction. The answers given by the latter are limited to a simple 
‘yes’, ‘no’, or ‘it doesn’t follow’. Inductive logic must split up the last 
alternative, which is of no interest to deductive logic, into a number 
of others, and say which of them it is most reasonable to believe on 
t The Orammar of Science, 1892. P. 16 of Everyman edition, 1038. 



8 


FUNDAMENTAL NOTIONS 


Chap. I 


the evidence available. Complete proof and disproof are merely the 
extreme cases. Any inductive inference involves in its very nature the 
possibility that the alternative chosen as the most likely may in fact 
be wrong. Exceptions are always possible, and if a theory does not 
provide for them it will be claiming to be deductive when it cannot be. 
On account of this extra generality, induction must involve postulates 
not included in deduction. Our problem is to state these postulates. 
It is important to notice that they cannot be proved by deductive 
logic. K they could, induction would be reduced to deduction, which 
is impossible. Equally they are not empirical generalizations; for in- 
duction would be needed to make them and the argument would be 
circular. We must in fact distinguish the general rules of the theory 
from the empirical content. The general rules are a priori propositions, 
accepted independently of experience, and making by themselves no 
statement about experience. Induction is the application of the rules 
to observational data. 

Our object, in short, is not to prove induction; it is to tidy it up. 
Even among professional statisticians there are considerable differences 
about the best way of treating the same problem, and, I think, all 
statisticians would reject some methods habitual in some branches of 
physics. The question is whether we can construct a general method, 
the acceptance of which would avoid these differences or at least reduce 
them. 

1.1. The test of the general rules, then, is not any sort of proof. This 
is no objection because the primitive propositions of deductive logic 
cannot be proved either. All that can be done is to state a set of 
hypotheses, as plausible as possible, and see where they lead us. The 
fullest development of deductive logic and of the foundations of mathe- 
matics is that of Principia McUhematica, which starts with a number of 
primitive propositions taken as axioms; if the conclusions are accepted, 
that is because we are willing to accept the axioms, not because the 
latter are proved. The same applies, or used to apply, to Euclid. We 
must not hope to prove our primitive propositions when this is the 
position in pure mathematics itself. But we have rules to guide us in 
stating them, largely suggested by the procedure of logicians and pure 
mathematicians. 

1. All h3rpotheses used must be explicitly stated, and the conclusions 
must follow from the hypotheses. 

2. The theory must be self-consistent; that is, it must not be possible 



51.1 


FUNDAMENTAL NOTIONS 


9 


to derive contradictory conclusions from the postulates and any given 
set of observational data. 

3. Any rule given must be applicable in practice. A definition is 
useless unless the thing defined can be recognized in terms of the 
definition when it occurs. The existence of a thing or the estimate of 
a quantity must not involve an impossible experiment. 

4. The theory must provide explicitly for the possibility that infer- 
ences made by it may turn out to be wrong. A law may contain 
adjustable parameters, which may be wrongly estimated, or the law 
itself may be afterwards found to need modification. It is a fact that 
revision of scientific laws has often been found necessary in order to 
take account of new information — the relativity and quantum theories 
providing conspicuous instances — and there is no conclusive reason to 
suppose that any of our present laws are final. But we do accept 
inductive inference in some sense; we have a certain amount of con- 
fidence that it will be right in any particular case, though this confidence 
does not amount to logical certainty. 

5. The theory must not deny any empirical proposition a priori-, any 
precisely stated empirical proposition must be formally capable of being 
accepted, in the sense of the last rule, given a moderate amount of 
relevant evidence. 

These five rules are essential. The first two impose on inductive logic 
criteria already required in pure mathematics. The third and fifth 
enforce the distinction between a priori and empirical propositions; if 
an existence depends on an inapplicable definition we must either find 
an applicable one, treat the existence as an empirical proposition 
requiring test, or abandon it. The fourth states the distinction between 
induction and deduction. The fifth makes Pearson’s distinction be- 
tween material and method explicit, and involves the definite rejection 
of attempts to derive empirically verifiable propositions from general 
principles adopted independently of experience. 

The following rules also serve as useful guides. 

6. The number of postulates should be reduced to a minimum. This 
is done for deductive logic in Principia, though many theorems proved 
there appear to be as obvious intuitively as the postulates. The motive 
for not accepting other obvious propositions as postulates is partly 
artistic. But we cannot regard the human mind as a perfect reasoner, 
and a reduction of the number of postulates affords a check on the 
consistency of different propositions, any of which we might be ready 
to accept by itself. This is still more needed in induction, since the 



10 


FUNDAMENTAL NOTIONS 


Chap. I 


beliefs often accepted as intuitively certain are more numerous, and, 
I believe, some of them are definitely inconsistent, while others are not 
primitive propositions but inductive inferences. If they are, they can- 
not, of course, be asserted as certain, but they may be asserted with 
so hi gh a probability that there will be little difference in practice. 

7. While we do not regard the human mind as a perfect reasoner, 
we must accept it as a useful one and the only one available. The 
theory need not represent actual thought-processes in detail, but should 
agree with them in outline. We are not limited to considering only 
the thought-processes that people describe to us. It often happens that 
their behaviour is a better criterion of their inductive processes than 
their arguments. If a result is alleged to be obtained by arguments 
that are certainly wrong, it does not follow that the result is wrong, 
since it may have been obtained by a rough inductive process that the 
author thinks it undesirable or unnecessary to state on account of 
the traditional insistence on deduction as the only valid reasoning. 
I disagree utterly with many arguments produced by the chief current 
schools of statistics, but I rarely differ seriously from the conclusions; 
their practice is far better than their precept. I should say that this 
is the result of common sense emerging in spite of the deficiencies of 
mathematical teaching. The theory must provide criteria for testing 
the chief types of scientific law that have actually been suggested or 
asserted. Any such law must be taken seriously in the sense that it can 
be asserted with confidence on a moderate amount of evidence. The 
fact that simple laws are often asserted will, on this criterion, require 
us to say that in any particular instance some simple law is quite likely 
to be true. 

8. In view of the greater complexity of induction, we cannot hope 
to develop it more thoroughly than deduction. We shall therefore take 
it as a rule that an objection carries no weight if an analogous objection 
would invalidate part of generally accepted pure mathematics. I do 
not wish to insist on any particular justification of pure mathematics, 
since authorities on its foundations are far from being agreed among 
themselves. In Principia much of higher mathematics, including the 

, whole theory of the continuous variable, rests on the axioms of infinity 
and reducibility, which are rejected by Hilbert. F. P. Ramsey rejects 
the axiom of reducibility, while declaring that the multiplicative axiom, 
properly stated, is the most evident tautology, though Whitehead and 
Russell express much doubt about it and carefully separate propositions 
that depend on it from those that can be proved without it. I should 



§1.1 FUNDAMENTAL NOTIONS H 

go further and say that the proof of the existence of numbers, according 
to the Principia definition of number, depends on the postulate that 
all individuals are permanent, which is an empirical proposition, and 
a false one, and should not be made part of a deductive logic. But we 
do not need such a proof for our purposes. It is enough that pure 
mathematics should be consistent. If the postulate could hold in some 
world, even if it was not the actual world, that would be enough to 
establish consistency. Then the derivation of ordinary mathematics 
from the postulates of Principia can be regarded as a proof of its con- 
sistency. But the justification of all the justifications seems to be that 
they lead to ordinary pure mathematics in the end; I shall assume that 
the latter has validity irrespective of any particular justification. 

The above principles will strike many readers as platitudes; and if 
they do I shall not object. But they require the rejection of several 
principles accepted as fundamental in other theories. They rule out, 
in the first place, any definition of probability that attempts to define 
probability in terms of infinite sets of possible observations, for we 
cannot in practice make an infinite number of observations. The Venn 
limit, the hypothetical infinite population of Fisher, and the ensemble 
of Willard Gibbs are useless to us by rule 3. Though many accepted 
results appear to be based on these definitions, a closer analysis shows 
that further hypotheses are required before any results are obtained, 
and these hypotheses are not stated. In fact, nO ‘objective’ definition 
of probability in terms of actual or possible observations, or possible 
properties of the world, is admissible. For, if we made an5rthing in our 
fundamental principles depend on observations or on the structure of 
the w'orld, we should have to say either (1) that the observations we 
can make, and the structure of the world, are initially unknowm; then 
we cannot know our fundamental principles, and we have no possible 
starting-point; or (2) that we know a priori something about observa- 
tions or the structure of the world, and this is illegitimate by rule 6. 
Attempts to use the latter principle will superpose our preconceived 
notions of what is objective on the entire system, whereas, if objectivity 
has any meaning at all, our aim must be to find end what is objective 
by means of observations. To try to give objective definitions at the 
start will at best produce a circular argument, may lead to contradic- 
tions, and in any case will make the whole scheme subjective beyond hope 
of recovery. We must not rule out any empirical proposition a priori; 
we must provide a system that will enable us to test it when occasion 
arises, and this requires a completely comprehensive formal scheme. 



12 


FUNDAMENTAL NOTIONS 


Chap. 1 


We must also reject what is variously called the principle of causality, 
determinism, or the uniformity of nature, in any such form as ‘ Precisely 
HiTnilar antecedents lead to precisely similar consequences’. No two 
sets of antecedents are ever identical; they must differ at least in time 
and position. But even if we decide to regard time and position as 
irrelevant (which may be true, but has no justification in pure logic) 
the antecedents are never identical. In fact, determinists usually recog- 
nize this verbally and try to save the principle by restating it in some 
such form as: ‘In precisely the same circumstances very similar things 
can be observed, or very similar things can usually be observed. ’f If 
‘precisely the same’ is intended to be a matter of absolute truth, we 
cannot achieve it. Astronomy is usually considered a science, biit the 
planets have never even approximately repeated their positions since 
astronomy began. The principle gives us no means of inferring the 
accelerations at a single instant, and is utterly useless. Further, if it 
was to be any use we should have to know at any application that the 
entire condition of the world was the same as in some previous instance. 
This is never satisfied in the most carefully controlled experimental 
conditions. The most that can be done is to make those conditions the 
same that we believe to be relevant — ‘the same’ can never in practice 
mean more than ‘the same as far as we know’, and usually means a 
great deal less. The question then arises, How do we know tliat the 
neglected variables are irrelevant? Only by actually allowing them to 
vary and verifying that there is no associated variation in the result; 
but this requires the use of significance tests, a theory of which must 
therefore be given before there is any apphcation of the principle, and 
when it is given it is found that the principle is no longer needed and 
can be omitted by rule 6. It may conceivably be true in some sense, 
though nobody has succeeded in stating clearly what this sense is. But 
what is quite certain is that it is useless. 

Causality, as used in applied mathematics, has a more general form, 
such as: ‘Physical laws are expressible by mathematical equations, 
possibly connecting continuous variables, such that in any case, given 
a finite number of parameters, some variable or set of variables that 
appears in the equations is uniquely determined in terms of the others. ’ 
This does not require that the values of the relevant parameters should 
be actually repeated; it is possible for an electrical engineer to predict 
the performance of a dynamo without there having already been some 
exactly similar dynamo. The equations, which we call laws, are inferred 
t W. H. Q«orge, The Scientist in Action, 1930, p. 48. 



§1.1 FUNDAMENTAL NOTIONS 18 

from previous instances and then applied to instances where the relevant 
quantities are different. This form permits astronomical prediction. 
But it still leaves the questions ‘How do we know that no other para- 
meters than those stated are needed?’, ‘How do we know that we need 
consider no variables as relevant other than those mentioned explicitly 
in the laws?’, and ‘Why do we believe the laws themselves?’ It is 
only after these questions have been answered that we can make any 
actual application of the principle, and the principle is useless until we 
have attended to the epistemological problems. Further, the principle 
happens to be false for quantitative observations. It is not true that 
observed results agree exactly with the predictions made by the laws 
actually used. The moat that the laws do is to predict a variation that 
accounts for the greater part of the observed variation; it never accounts 
for the whole. The balance is called ‘error’ and usually quickly for- 
gotten or altogether disregarded in physical writings, but its existence 
compels us to say that the laws of applied mathematics do not express 
the whole of the variation. Their justification cannot be exact mathe- 
matical agreement, but only a partial one depending on what fraction 
of the observed variation in one quantity is accounted for by the 
variations of the others. The phenomenon of error is often dealt 
with by a suggestion of various minor variations that might alter the 
measurements, but this is no answer. An exact quantitative prediction 
could never be made, even if such a suggestion was true, unless we 
knew in each individual case the actual amounts of the minor varia- 
tions, and we never do. If we did we should allow for them and obtain 
a still closer agreement; but the fact remains that in practice, however 
fully we take small variations into account, we never get exact agree- 
ment. A physical law, for practical use, cannot be merely a statement 
of exact predictions; if it was it would invariably be wrong and would 
be rejected at the next trial. Quantitative prediction must always be 
prediction within a margin of uncertainty; the amount of this margin 
will be different in different cases, but for a law to be of any use it 
must state the margin explicitly. The outstanding variation, for prac- 
tical application, is as essential a part of the law as the predicted 
variation is, and a valid statement of the law must express it. But in 
any individual case this outstanding variation is not known. We know 
only something about its possible range of values, not what the actual 
value will be. Hence a physical law is not an exact prediction, but a state- 
ment of the relative probabilities of variations of different amounts. It is 
only in this form that we can avoid rejecting causality altogether as false. 



14 


FUNDAMENTAL NOTIONS 


Chap. I 


or as inapplicable under rule Z\buia statement of ignorance of the individval 
errors has become an essential part of it, and we must recognize that the 
physical law itself, if it is to be of any use, must have an epistemological 
content. 

The impossibility of exact prediction has recently been forced on the 
attention of physicists by Heisenberg’s Uncertainty Principle. It is 
remarkable, considering that the phenomenon of errors of observation 
was discussed by Laplace and Gauss, that there should still have been 
any physicists that thought that actual observations were exactly pre- 
dictable; yet attempts to evade the principle have shown that many 
exist. The principle is actually no new uncertainty. What Heisenberg 
has done is to consider the most refined types of observation that 
modern physics suggests might be possible, and to obtain a lower limit 
to the uncertainty; but it is much smaller than the old uncertainty, 
which was never neglected except by misplaced optimism. The exist- 
ence of errors of observation seems to have escaped the attention of 
many philosophers that have discussed the uncertainty principle; this 
is perhaps because they tend to get their notions of physics from popular 
writings, and not from works on the combination of observations. Their 
criticisms of popular physics, mostly valid as far as they go, would gain 
enormously in force if they attended to what we knew about errors 
before Heisenberg, f 

The word error is liable to be interpreted in some ethical sense, but 
its scientific meaning is closely connected with the original one. Latin 
errare, in its original sense, means to wander, not to sin or to make 
a mistake. The meaning occurs in ‘knight-errant’. The error means 
simply the outstanding variation after we have done our best to inter- 
pret the whole variation. 

The criterion of universal assent, stated by Dr. N. R. Campbell and 
by Professor H. Dingle in his Science and Human Experience (but 
abandoned in his Through Science to Philosophy), must also be rejected 

t Professor L. S. Stebbing (Philosophy and the Physicists, 1938, p. 198) remarks: 
‘There can be no doubt at all that precise predictions concerning the behaviour of 
macroscopic bodies are made and are exactly verified within the limits of experimental 
error.’ Without the saving phase at the end the statement is intelligible, and false. 
With it, it is meaningless. The severe criticism of much in modem physics contained 
in this book is, in my opinion, thoroughly justified, but the later parts lose much of 
their point through inattention to the problem of errors of observation. Some philo- 
sophers, however, have seen the point quite clearly. For instance. Professor J. H. 
Muirhead (The Elements of Ethics, 1910, pp. 37-8) states: ‘The truth is that what is 
called a natural law is itself not so much a statement of fact as of a standard or type 
to which facts have been foimd more or less to approximate. This is true even in 
inorganic nature.’ 1 am indebted to Mr. John Bradley for the reference. 



I 1.1 FUNDAMENTAL NOTIONS 16 

by rule 3. This criterion requires general acceptance of a principle 
before it can be adopted. But it is impossible to ask everybody’s con- 
sent before one believes anything; and if ‘everybody’ is replaced by 
‘everybody qualified to judge’, we cannot apply the criterion until we 
know who is quahfied, and even then it is liable to happen that only 
a small fraction of the people capable of expressing an opinion on a 
scientific paper read it at all, and few even of those do express any. 
Campbell lays much stress on a physicist’s characteristic intuition,f 
which apparently enables him always to guess right. But if there is 
any such intuition there is no need for the criterion of general agree- 
ment or for any other. The need for some general criterion is that even 
among those apparently qualified to judge there are often serious 
differences of opinion about the proper interpretation of the same 
facts; what we need is an impersonal criterion that will enable an 
individual to see whether, in any particular instance, he is following 
the rules that other people follow and that he himself follows in other 
instances. 

The chief constructive rule is 4. It declares that there is a valid 
primitive idea expressing the degree of confidence that we may reason- 
ably have in a proposition, even though we may not be able to give 
either a deductive proof or a disproof of it. In extreme cases it may 
be a mere statement of ignorance. We need to express its rules. One 
obvious one (though it is very commonly overlooked) is that it depends 
both on the proposition considered and on the data in relation to which 
it is considered. Suppose that I know that Smith is an Englishman, 
but otherwise know nothing particular about him. He is very likely, 
on that evidence, to have a blue right eye. But suppose that I am 
informed that his left eye is brown — the probability is changed com- 
pletely. This is a trivial case, but the principle in it constitutes most 
of our subject-matter. It is a fact that our degrees of confidence in 
a proposition habitually change when we make new observations or 
new evidence is communicated to us by somebody else, and this change 
constitutes the essential feature of all learning from experience. We 
must therefore be able to express it. Our fundamental idea will not be 
simply the probability of a proposition p, but the probability of p on 
data q. Omission to recognize that a probability is a function of two 
arguments, both propositions, is responsible for a large number of 
serious mistakes; in some hands it has led to correct results, but at the 
t Ariatot. Soc. Suppl. vol. 17, 1938, 122. 



16 


FUNDAMENTAL NOTIONS 


Chap. I 


cost of omitting to state essential hypotheses and giving a delusive 
appearance of simplicity to what are really very difficult arguments. 
It is no more valid to speak of the probability of a proposition withovi 
stating the data than it uxmld be to speak of the value of x-\-y for given x, 
irrespective of the value of y. 

We can now proceed on rule 7. It is generally believed that proba- 
bilities are orderable: that is, that if p, q, r are three propositions, 
the««tatement ‘on data p, q is more probable than r’ has a meaning. 
In actual cases people may disagree about which is the more probable, 
and it is sometimes said that this implies that the statement has no 
meaning. But the differences may have other explanations; (1) The 
commonest is that the probabilities are on different data, one person 
having relevant information not available to the other, and we have 
made it an essential point that the probability depends on the data. 
The conclusion to draw in such a case is that, if people argue without 
telling each other what relevant information they have, they are wasting 
their time. (2) The estimates may be wrong. It is perfectly possible to 
get a wrong answer in pure mathematics, so that by rule 8 this is no 
objection. In this case, where the probability is often a mere guess, 
we cannot expect the answer to be right, though it may be and often 
is a fair approximation. (3) The wish may be father to the thought. 
But perhaps this also has an analogue in pure mathematics, if we con- 
sider the number of fallacious methods of squaring the circle and 
proving Fermat’s last theorem that have been given, merely because 
people wanted w to be an algebraic or rational number or the theorem 
to be true. In any case alternative hypotheses are open to the same 
objection, on the one hand, that they depend on a wish to have a wholly 
deductive system and to avoid the explicit statement of the fact that 
scientific inferences are not certain; or, on the other, that the statement 
that there is a most probable alternative on given data may curtail their 
freedom to believe another when they find it more pleasant. I think 
that these reasons account for all the apparent differences, but they 
are not fundamental. Even if people disagree about which is the more 
probable alternative, they agree that the comparison has a meaning. 
We shall assume that this is right. The meaning, however, is not a 
statement about the external world; it is a relation of inductive logic. 
Our primitive notion, then, is that of the relation ‘given p, g is more 
probable than r’, where p, q, and r are three propositions. If this is 
satisfied in a particular instance, we say that r is less probable than q, 
given p; this is the definition of less probable. If given p, q is neither 



FUNDAMENTAL NOTIONS 


17 


il.2 

more nor less probable than r, q and r are equally probable, given p. 
Then our first axiom is 

Axiom 1. Given p, q is either more, equally, or less probable than r, 
and no two of these alternatives can be true. 

This axiom may be called that of the comparability of probabilities. 
In Scientific Inference I took it in a more general form, assuming that 
the probabilities of propositions on different data can be compared. 
But this appears to be unnecessary, because it is found that the com- 
parability of probabilities on different data, whenever it arises in 
practice, is proved in the course of the work and needs no special axiom. 
The fundamental relation is transitive; we express this as follows. 

Axiom 2. If p, q, r, s are four propositions, and, given p, q is more 
probable than r and r is more probable than s, then, given p, q is more 
probable than s. 

The extreme degrees of probability are certainty and impossibihty. 
These lead to 

Axiom 3. All propositions deduciblefrom a proposition p have the same 
probability on data p; and all propositions inconsistent with p have the 
same probability on data p. 

We need this axiom to ensure consistency with deductive logic in 
cases that can be treated by both methods. We are trying to construct 
an extended logic, of which deductive logic will be a part, not to intro- 
duce an ambiguity in cases where deductive logic already gives definite 
answers. I shall often speak of ‘certainty on data p’ and ‘impossibihty 
on data p’. These do not refer to the mental certainty of any particular 
individual, but to the relations of deductive logic expressed by ‘g is 
deducible from p' and ‘not-g' is deducible from p\ In G. E. Moore’s 
terminology, we may read the former as ‘p entails q’. In consequence 
of our rule 5, we shall never have ‘p entails g’ if p is merely the general 
rules of the theory and q is an empirical proposition. 

Actually I shall take ‘entails’ in a slightly extended sense; in some 
usages it would be held that p is not deducible from p, or from p and q 
together. Some shortening of the writing is achieved if we agree to 
define ‘p entails g’ as meaning either ‘g is deducible from p’ or ‘g is 
identical with p’ or ‘g is identical with some proposition asserted in p’. 
This avoids the need for special attention to trivial cases. 

We also need the following axiom. 

Axiom 4. If, given p, q and q' cannot both be true, and if, given p, 

ait&.68 Q 



18 


FUNDAMENTAL NOTIONS 


Chap. I 


r and r' cannot both be true, and if, given p, q and r are equally probable 
and q' and r' are equally probable, then, given p,‘q or q” and ‘r or r" are 
equally probable. 

At this stage it is desirable for clearness to introduce the following 
notations and terminologies, mainly from Principia Mathematica. 
means ‘not-ja’; that is, p is false. 

p.q means ‘p and q’\ that is, p and q are both true. 

pwq means 'p or q'', that is, at least one of p and q is true. 

These notations may be combined, dots being used as brackets. Thus 
:p.q means ‘p and q is not true’; that is, at least one of p and q 
is false, which is equivalent to ~ jp.v. ~ g. But 

'^p.q means ‘p is false and q is true’, which is not the same pro- 
position. The rule is that a set of dots represents a bracket, the com- 
pletion of the bracket being either the next equal set of dots or the 
end of the expression. Dots may be omitted in joint assertions where 
no ambiguity can arise. 

The joint assertion or conjunction of p and q is the proposition p.q; 
and the joint assertion of p, q, r, s,... is the proposition ^.g.r.s...; that 
is, that p, q,r,s,... are all true. The joint assertion is also called the 
logical product. 

The disjunction of p and q is the proposition p'^ q; the disjunction 
of p, g, r, 5 is the proposition p y q v r v s, that is, at least one of p, q, r, s 
is true. The disjunction is also called the logical sum. 

A set of propositions (i = 1 to n) are said to be exclusive on data p 
if not more than one of them can be true on data p; that is, if p entails 
all the disjunctions ~ g< v ~ when i ^ k. 

A set of propositions g, r, s are said to be exhaustive on data if at 
least one of them must be true on data p; that is, if p entails the dis- 
jimction q\ rw s. 

It is possible for a set of alternatives to be both exclusive and 
exhaustive. For instance, a finite class must have some number n; 
then the propositions n = 0, 1, 2, 3,... must include one true proposi- 
tion, but cannot contain more than one. 

Then Axiom 4 will read: 

If q and q' are exclusive, and r and r' are exclusive, on data p, and if, 
given p, q and r are equally probable and q' and r' are equally probable, 
then, given p,qs (f and ry r' are equally probable. 

An immediate extension, obtained by successive applications of this 
axiom, is: 



§ 1.2 


FUNDAMENTAL NOTIONS 


10 


Theorem 1. a^e exclusive, and r^, rj,.--, are exclusive, 

on data p, and if, given p, the propositions and r^, and and r„ 

are equally probable in pairs, then given p, ?i v a,nd r^...y 

are equally probable. 

It will be noticed that we have not yet assumed that probabilities 
can be expressed by numbers. I do not think that the introduction of 
numbers is strictly necessary to the further development; but it has the 
enormous advantage that it permits us to use mathematical technique. 
Without it, while we might obtain a set of propositions that would have 
the same meanings, their expression would be much more cumbrous. 
The actual introduction of numbers is done by conventions, the nature 
of which is essentially linguistic. 

Convention 1. We assign the larger number on given data to the more 
probable proposition {and therefore equal numbers to equally probable 
propositions). 

Convention 2. If, given p, q and q' are exclusive, then the number 
assigned on data p to ‘q or q’' is the sum of those assigned to q and to q'. 

It is important to notice the meaning of a convention. It is neither 
an axiom nor a theorem. It is merely a rule introduced for convenience, 
and it has the property that other rules would give the same results. 
W. E. Johnson remarks that a convention is properly expressed in the 
imperative mood. An instance is the use of rectangular or polar coordi- 
nates in Euclidean geometry. The distance between two points is the 
fundamental idea, and aU propositions can be stated as relations be- 
tween distances. Any proposition in rectangular coordinates can be 
translated into polar coordinates, or vice versa, and both expressions 
would give the same results if translated into propositions about 
distances. It is purely a matter of convenience which we choose in a 
particular case. The choice of a unit is always a convention. But care 
is needed in introducing conventions; some postulate of consistency 
about the fundamental ideas is liable to be hidden. It is quite easy 
to define an equilateral right-angled plane triangle, but that does not 
make such a triangle possible. In this case Convention 1 specifies what 
order the numbers are to be arranged in. Numbers can be arranged in 
an order, and so can probabilities, by Axioms 1 and 2. The relation 
‘greater than’ between numbers is transitive, and so is the relation 
‘more probable than’ between propositions on the same data. There- 
fore it is possible to assign numbers by Convention 1, so that the order 
of increasing degrees of belief will be the order of increasing number. 



20 


FUNDAMENTAL NOTIONS 


Chap. I 


So far we need no new axiom; but we shall need the axiom that there 
are enough numbers for our purpose. 

Axiom 6. The set of possible probabilities on given data, ordered in 
terms of the relation ‘mare probable than', can be put into ane-om, corre- 
spondence with a set of real numbers in increasing order. 

The need for such an axiom was pointed out by an American reviewer 
of Scientific Inference. He remarked that if we take a series of number 
pairs and make it a rule that is to be placed after Ug if 

Of > Og, but that if a, = o,, u^ is to be placed after Ug if b^ > bg, then 
the axiom that the can be placed in an order wiU hold, but if and 
can each take a continuous series of values it will be impossible to 
establish a one-one correspondence between the pairs and a single 
continuous series without deranging the order. 

Convention 2 and Axiom 4 will imply that, if we have two pairs of 
exclusive propositions with the same probabilities on the same data, 
the numbers chosen to correspond to their disjunctions will be the 
same. The extension to disjunctions of several propositions is justi- 
fied by Theorem 1. We shall always, on given data, associate the 
same numbers with propositions entailed or contradicted by the data; 
this is justified by Axiom 3. The assessment of numbers in the way 
suggested is therefore consistent with our axioms. We can now intro- 
duce the formal notation I 

for the number associated with the probability of the proposition q on 
data p’, it may be read ‘the probability of q given p' provided that we 
remember that the number is not in fact the probability, but merely 
a representation of it in terms of a pair of conventions. The probability, 
strictly, is the reasonable degree of confidence and is not identical with 
the number used to express it. The relation is that between Mr. Smith 
and his name ‘Mr. Smith’. A sentence containing the words ‘Mr. Smith’ 
may correspond to, and identify, a fact about Mr. Smith. But Mr. 
Smith himself does not occur in the sentence.^ In this notation, the 
properties of numbers will now replace Axiom 1 ; Axiom 2 is restated 
‘iiP{q\p) > P{r\p),&ndP{r\p) > P(« (p), then P(? |p) > P(5|p)’, 
which is a mere mathematical implication, since all the expressions are 
numbers. Axiom 3 will require us to decide what numbers to associate 
with certainty and impossibility. We have 

Theokbm 2. Ifp is consis tent with the general rules, and p entails ~ q, 
then P{q (p) = 0. 

t Cf. R. Carnap, The Logical Syntax of Language. 



§ 1.2 FUNDAMENTAL NOTIONS 21 

For let g and r be any two propositions, both impossible on data p. 
Then (Ax. 3) if a is the number associated with impossibility on A&t&p, 

P{q \p) = P{r \p) = P(g V r \p) = a 

since g, r, and g v r are all impossible propositions on data p and must 
be associated with the same number. But qr is impossible on data p', 
hence, by definition, g and r are exclusive on data jj, and (Conv. 2) 

P(g V r Ip) = P[q \p)+P{r |p) = 2o; 

whence a — 0. Therefore all probability numbers are ^ 0, by Con- 
vention 1. 

As we have not assumed the comparability of probabilities on dif- 
ferent data, attention is needed to the possible forms that can be 
substituted for g and r, given p. If p is a purely a priori proposition, 
it can never entail an empirical one. Hence, if p stands for our general 
rules, the admissible values for g and r must be false a priori proposi- 
tions, such as 2 = 1 and 3 = 2. Since such propositions can be stated 
the theorem follows. If p is empirical, then ~ p is an admissible value 
for both g and r. Or, since we are maintaining the same general prin- 
ciples throughout, we may remember that in practice if p is empirical 
and we denote our general principles by h, then any set of data that 
actually occurs and includes an empirical proposition will be of the 
form ph. Then for g and r we may still substitute false a priori pro- 
positions, which will be impossible on data ph. Hence it is always 
possible to assign q and r so as to satisfy the conditions stated in the 
proof. 

Convention 3. If p entails q, then P{q Ip) = 1. 

This is the rule generally adopted; but there are cases where we wish 
to express ignorance over an infinite range of values of a quantity, and 
it is then convenient to express certainty that the quantity Hes in that 
range by oo, in order to keep ratios for finite ranges determinate. None 
of our axioms so far has stated that we must always express certainty 
by the same number on different data, merely that we must on the 
same data; but with this exception it is convenient to do so. 

The converse of Theorem 2 would be: ‘If P{q |p) = 0, then p entails 
~ g.’ This is false if we use Convention 3. For instance, a continuous 
variable may be equally likely to have any value between 0 and 1. 
Then the probability that it is exactly f is 0, but i is not an impossible 
value. There would be no point in making certainty correspond to 
infinity in such a case, for it would make the probability infinite for 



22 FUNDAMENTAL NOTIONS Chap, I 

any finite range. It turns out that we have no occasion to use the 
converse of Theorem 2. 

Axiom 6. If pg. entails r, then P{qr \ p) = P(q |/)). 

In other words, given p throughout, we may consider whether q is 
false or true. If q is false, then qr is false. If q is true, then, since pq 
entails r, r is also true and therefore qr is true. Similarly, if qr is true 
it entails q, and if qr is false q must be false on data p, since if it was 
true qr would be true. Thus it is impossible, given p, that either q or 
qr should be true without the other. This is an extension of Axiom 3 
and is necessary to enable us to take over a further set of rules sug- 
gested by deductive logic, and to say that all equivalent propositions 
have the same probability on given data. 

Theorem 3. If q and r are equivalent in the sense that each entails the 
other, then each entails qr, and the probabilities of q and r on any data must 
be equal. Similarly, if pq entails r, and pr entails q, P[q \ p) — P(r \p), 
since both are equal to P(qr \p). 

An immediate corollary is 

Theorem 4. P{q \ p) = P(qr \p)+P(q. ~ r | ji). 

For qr and q. r are exclusive, and the sum of their probabilities on 
any data is the probability of qriv.q. r (Conv. 2). But q entails 
this proposition, and also, if either q and r are both true or q is true 
and T false, q is true in any case. Hence the propositions q and 
qr.y.q. r are equivalent, and the theorem follows by Theorem 3. 

It follows further that P(q \ p) ^ P(qr \p), since P{q. ^ r\p) cannot 
be n^ative. Also, if we write qw r for q, we have 

P(q V r Ijp) = P{q v r:r \p)-\-P(q r: r \p) (Th. 4) 
and g V r;r is equivalent to r, and gvr:~rtog.~r. Hence 

P(qwr\p) > P{r\p). 

Theorem 6. If q and r are two propositions, not necessarily exclusive 
on data p, 

-P(3 \P)+P(r \p) = P(qwr \p) + P{qr \p). 

For the propositions qr, q. r, q.r, q. r are exclusive; and 
q is equivalent to the disjimction of qr and q. r, and r to the dis- 
junction of qr and q.r. Hence the left side of the equation is equal to 

2P{qr \p)-¥P{q- ~ r q-r\p) (Th. 4). 

Also g V r is equivalent to the disjunction of qr, q. r, and q.r. 



FUNDAMENTAL NOTIONS 


23 


1 1.2 

Hence 

P{q V r Ijj) = P(g'r \p)+P{q. ~ r lj))+P(~ q.r \p) (Th. 4), 

whence the theorem foUows. 

It follows that, whether q and r are exclusive or not, 

P(qy/r\p)^ P{q\p) + P{r \p), 

since P(qr | p) cannot be negative. Theorems 4 and 5 together express 
upper and lower bounds to the possible values of P(q v r | p) irrespective 
of exclusiveness. It cannot be less than either P(q\p) or P(rjp); it 
cannot be more than P{q \p) + P(r \p). 

Theorem 6. If q^, q^,... are a set of equally probable and exclusive 
alternatives on data p, and if Q and R are disjunctions of two subsets of 
these aUematives, of numbers m and n, then P{Q \p)IP(B \ p) = mfu. 

For if a is any one of the equal numbers P{qi \p), Piq^ \p),-- we 
have, by Convention 2, 

P{Q\p) = frM.'y P(R\p) — na\ 
whence the theorem follows. 


Theorem 7. In the conditions of Theorem 6, if q^,..., are 
exhaustive on data p, and R denotes their disjunction, then R is entailed 

by p and P{R\p) = 1 (Conv. 3). 


It follows that P(Q Ip) = mjn. 

This is virtually Laplace’s rule, stated at the opening of the Thdorie 
Analytique. R entails itself and therefore is a possible value ofp; hence 

P{Q I R) = mjn. 

This may be read; given that a set of aUematives are equally probable, 
exclusive, and exhaustive, the probability that some one of any subset is 
true is the ratio of the number in that subset to the whole number of possible 
cases. This form depends on Convention 3, and must be used only in 
cases where that convention is adopted. Theorem 6, however, is inde- 
pendent of Convention 3. If we chose to express certainty on data p 
by 2 instead of 1 , the only change would be that aU numbers associated 
with probabilities on data p would be multiplied by 2, and Theorem 6 
would still hold. Theorem 6 is also consistent with the possibility that 
the number of alternatives is infinite, since it requires only that Q and 
R shall be finite subsets. But in this case the number associated with 
the probability of any infinite subset may be infinite and Convention 3 
is then unsuitable. 



24 


FUNDAMENTAL NOTIONS 


Chap. I 


Theorems 6 and 7 tell us how to assess the ratios of probabilities, 
and, subject to Convention 3, the actual values, provided that the 
propositions considered can be expressed as finite subsets of equally 
probable, exclusive, and, for Theorem 7, exhaustive alternatives on the 
data. Such assessments will always be rational fractions, and may be 
called j?-probabilities. Now a statement that m and n cannot exceed 
some given value would be an empirical proposition asserted a priori, 
and would be inadmissible on rule 6. Hence the iJ-probabilities possible 
within the formal scheme form a set of the ordinal type of the rational 
fractions. 

If all probabilities were i2-probabilities there would be no need for 
Axiom 5, and the converse of Theorem 2 could hold. But many pro- 
positions that we shall have to consider are of the form that a magni- 
tude, capable of a continuous range of values, lies within a specified 
part of that range, and we may be unable to express them in the 
required form. Thus there is no need for all probabilities to be im- 
probabilities. However, if a proposition is not expressible in the 
required form, it will still be associated with a reasonable degree of 
belief by Axiom 1, and this, by Axiom 2, will separate the degrees for 
ii-probabilities into two segments, according to the relations ‘more 
probable than’ and ‘less probable than’. The corresponding numbers, 
the im-probabilities themselves, will be separated by a unique real 
number, by Axiom 6 and an application of Dedekind’s section. We 
take the numerical assessment of the probability of a proposition not 
expressible in the form required by Theorems 6 and 7 to be this number. 
Hence we have 

Theorem 8. Any probability can be expressed by a real number. 

If X is a variable capable of a continuous set of values, we may 
consider the probability on data p that x is less than Xq, say 

P(x <a:olp) =/(Xo). 

If /(Xj) is differentiable we shall then be able to write 

P(Xo < X < x„+dXf, I p) = /'(xo)dxo+o (dxo). 

We shall usually write this briefly P(dx |p) = f'(x)dx, dx on the left 
meaning the proposition that x lies in a particular range dx. f'{x) is 
called the probability density. 


Theorem 9. If Q is the disjunction of a set of exclusive alternatives 
on data p, and if R and S are subsets of Q {possibly overlapping) and if 



§ 1.2 


FUNDAMENTAL NOTIONS 


25 


the alternatives in Q are all equally probable on data p and also on data 
Rp, then 

P(R8\p) = P(R \p)P{8 I Rp)IP{R I Rp). 

For suppose that the propositions contained in Q are of number n, 
that the subset R contains m of them, and that the part common to 
R and /S contains I of them. Put 

P(QIP) = a. 

Then, by Theorem 6, 

P(R Ip) = ma/n; P(R8 |p) = lajn. 

P{8 1 Rp) is the probability that the true proposition is in the 8 subset 
given that it is in the R subset and p, and therefore is equal to 
(ljm)P{R I Rp). Also R8p entails R-, hence 

P{S I Rp) = P(SR I Rp) (Ax. 6) 

and 

P{R8 ip) = (1lm)(ma/n) = P{R \p)P(8 | Rp)IP(R j Rp). 

This is the first proposition that we have had that involves probabilities 
on different data, two of the factors being on datap and two on data Rp. 
Q itself does not appear in it and is therefore irrelevant. It is introduced 
into the theorem merely to avoid the use of Convention 3. It might be 
identical with any finite set that includes both R and 8. 

The proof has assumed that the alternatives considered are equally 
probable both on data p and also on data Rp. It has not been found 
possible to prove the theorem without using this condition. But it is 
necessary to further developments of the theory that we shall have 
some way of relating probabilities on different data, and Theorem 9 
suggests the simplest general rule that they can follow if there is one 
at all. We therefore take the more general form as an axiom, as follows. 

Axiom 7. For any propositions p, q, r, 

P{qr Ip) = P{q \p)P{r \ ?p)/P(? |?p). 

If we use Convention 3 on data qp (not necessarily on data p), 
P(q I grp) = 1, and we have W. E. Johnson’s form of the product rule, 
which can be read: the. probability of the joint assertion of tioo propositions 
on any data p is the product of the probability of one of them on data p 
and that of the other on the first and p. 

We notice that the probability of the logical sum follows the addition 
rule (with a caveat), that of the logical product the product rule. This 
parallel between the Principia and probability language is lost when the 
joint assertion is called the sum, as has occurred in some recent writings. 



26 


FUNDAMENTAL NOTIONS 


Chap. I 


In a sense a probability can be regarded as a logical quotient, since in the 
conditions of Theorem 7 the probability of Q given R is the probability 
of Q given p divided by that of R given p. This has been recognized 
in the history of the notation, which Keynesf traces to H. McColl. 
McColl wrote the probability of a, relative to the a priori premiss h, 
as a/e, and relative to bh as ajb. This was modified by W. E. Johnson 
to ajh and ajbh, and he is followed by Keynes, Broad, and Ramsey. 
Wrinch and I found that this notation was inconvenient when the 
solidus may have to be used in its usual mathematical sense in the 
same equation, and introduced P{p:q), which I modified further to 
P{p I q) in Scientific Inference, because the colon was beginning to be 
needed in the Principia sense of a bracket. 

The sum of two classes a and in Principia , is the class y such that 
every member of a or of ^ is in y, and conversely. The product class of 
a and ^ is the class 8 of members common to a and j8. Thus Theorem 6 
has a simple analogy with the numbers of members of the classes a and 
jS, y and 8. The multiplicative class of at and j3 is the class of all pairs, one 
from a and one from it is this class, not the product class, that gives 
an interpretation to the product of the numbers of members of a and /3. 

The extension of the product rule from Theorem 9 to Axiom 7 has 
been taken as axiomatic. This is an application of a principle repeatedly 
adopted in Principia Mathematica. If there is a choice between possible 
axioms, we take the one that enables most consequences to be drawn. 
Such a generalization is not inductive. What we are doing is to seek for 
a set of axioms that will permit the construction of a theory of induction, 
the axioms themselves being primitive postulates. The choice is limited 
by rule 6; the axioms must be reduced to the minimum number, and 
the check on whether we make them too general will be provided by 
rule 2, which will reject a theory if it is found to lead to contradictory 
consequences. Consider then whether the rule 

P{qr\p) = P(q\p)P{r\qp) 

can hold in general. Suppose first that p entails '^:qr; then either p 
entails '^q, or p and q together entail ~ r. In either case both sides of 
the equation vanish and the rule holds. Secondly, suppose that p entails 
qr-, then p entails q and pq entails r. Thus both sides of the equation 
are 1. Similarly, we have consistency in the converse cases where p 

t Treatise cm Probability, 1921, p. 156. This book is full of interesting historical data 
«md contains many important critical remarks. It is not very successful on the con- 
structive side, since an unwillingness to generalize the axioms has prevented Keynes 
from obtaining numy important results. 



§ 1.2 


FUNDAMENTAL NOTIONS 


27 


entails i^q, or pq entails ~j-, ot p entails q and pq entails r. This 
covers the extreme cases. 

If there are any cases where the rule is untrue, we shall have to say 
that in such cases P(qr | p) depends on something besides P(q \ p) and 
P(r I qp), and a new hypothesis would be needed to deal with such cases. 
By rule 6, we must not introduce any such hypothesis unless need for 
it is definitely shown. The product rule may therefore be taken as 
general unless it can be shown to lead to contradictions. We shall see 
(p. 35) that consistency can be proved in a wide class of cases. 

1 . 21 . The product rule is often misread as follows: the joint proba- 
bility of two propositions is the product of their probabilities separately. 
This is meaningless as it stands because the data relative to which the 
probabilities are considered are not mentioned. In actual application, 
the rule so stated is liable to become: the joint probability of two pro- 
positions on given data is the product of their separate probabilities 
on those data. This is false. We may see this by considering extreme 
cases. The correct statement of the rule may be written (using 
Convention 3 on data pr) 

P{pq\r) = P(,p\r)P(q \pr) (1) 

and the other one as 

Pin I r) = P(p ! r)P{q \ r). (2) 

lip cannot be true given r, then p and q cannot both be true, and both 
(1) and (2) reduce to 0 = 0. If^ is certain given r, both reduce to 

P{q\r)^ P(q\T) (3) 

since in ( 1 ) the inclusion of p in the data tells us nothing about q that 
is not already told us by r. If q is impossible given r, both reduce to 
0 = 0. If g' is certain given r, both reduce to 

POP j r) = P{p \ r). (4) 

So far everything is satisfactory. But suppose that q is impossible 
given pr. Then it is impossible for pq to be true given r, and (1) 
reduces correctly to 0 = 0. But (2) reduces to 

0 = P(p\r)P{q\r), 

which is false; it is perfectly possible for both p and g to be consistent 
with r and pq to be inconsistent with r. Consider the following. Let 
r consist of the following information: in a given population aU the 
members have eyes of the same colour; half of them have blue eyes and 
half brown; one member is to be chosen, and any member is equally 
likely to be selected, p is the proposition that his left eye is blue, q the 



28 


FUNDAMENTAL NOTIONS 


Chap. I 


proposition that his right eye is brown. What is the probability, on 
data r, that his left eye is blue and his right brown ? P{p \ r) and 
P{q I r) are both and according to (2) P(j)q 1 r) = But according 
to (1) the probability that his right eye is brown must be assessed 
subject both to the information that his eyes are of the same colour 
and that his left eye is blue, and this probability is 0. Thus (1) gives 
P{pq I r) = 0. Clearly the latter result is right; further applications of 
the former, considering also (left eye brown) and '^q (right eye 
blue) lead to the astonishing result that on data including the pro- 
position that all members have two eyes of the same colour, it is as 
likely as not that any member will have eyes of different colours. 

This trivial instance is enough to dispose of (2); but (2) has been 
widely applied in cases where it gives wrong results, and sometimes 
seriously wrong ones. The Boltzmann -theorem of the kinetic theory 
of gases rests on a fallacious application of it, since it considers an 
assembly of molecules, possibly with differences of density from place 
to place, and gives the joint probability that two molecules will be in 
adjoining regions as the product of the separate probabilities that they 
will be. If there are differences of density, and one molecule is in a 
region chosen at random, that is some evidence that the region is one of 
high density; then the probability that a second is in tlie region, given 
that the first is, is somewhat higher than it would be in the absence of 
information about the first. Similar considerations apply to Boltz- 
mann’s treatment of the velocities. In this case the mistake has not 
prevented the right result from being obtained, though it does not 
follow from the hypotheses. 

Nevertheless there are many cases where (2) is true. If 

P{q\pr) = P{q\r) 

we say that p is irrelevant to q, given r. 


1,22. Theorem 10. If q^, q^ are a set of alternatives, H the 

information already available, and p some additional information, then 


the ratio 


P(gJpg)P(grlgrg) 

P{q,\H)P{p\q,H) 


is the mme for all the q^. 


By Axiom 7 


I == Pip I ^)Pi^T \pH)IP{p \pH) 


= P(qAH)P{p\q,H)jP{q,\qrH), 


( 1 ) 

( 2 ) 



FUNDAMENTAL NOTIONS 


29 


Sl-2 


whence 


P{qr\pH)P(q^\q^H) _ P{p \pH) 

P{p\H) 


(3) 


P(qAH)P{p\q,H) 
which is independent of q^. 

If we use unity to denote certainty on data q^H for all the 
(3) becomes ^ ^ ^ 

for variations of q^. This is the principle of inverse probability, first 
given by Bayes in 1763. It is the chief rule involved in the process of 
learning from eyperience. It may also be stated, by means of the product 
rule, as follows: 


P{q,\pH)cc P{pq^\H). 


( 6 ) 


This is the form used by Laplace, by way of the statement that the 
posterior probabilities of causes are proportional to the probabilities 
a priori of obtaining the data by way of those causes. In the form 
(4), if p is a description of a set of observations and the q^ a set of 
hypotheses, the factor P{q^\H) may be called the prior probability, 
P{q^\pH) the posterior probability, and Piplq^H) the likelihood, a 
convenient term introduced by Professor R. A. Fisher, though in his 
usage it is sometimes multiplied by a constant factor. It is the proba- 
bility of the observations given the original information and the 
hypothesis under discussion. The term a priori probabihty is sometimes 
used for the prior probability, but this term has been used in so many 
senses that the only solution is to abandon it. To Laplace the a priori 
probability meant P[pqr \H), and sometimes the term has even been 
used for the likelihood. A priori has a definite meaning in logic, in 
relation to propositions independent of experience, and we frequently 
have need to use it in this sense. We may then state the principle of 
inverse probability in the form: Tftc posterior probabilities of the hypo- 
theses are proportional to the prodwts of the prior probabilities and the 
likelihoods. The constant factor will usually be fixed by the condition 
that one of the propositions q^ to q^ must be true, and the posterior 
probabilities must therefore add up to 1 . (If 1 is not suitable to denote 
certainty on data pH, no finite set of alternatives will contain a finite 
fraction of the probability. The rule covers all cases when there is 
anything to say.) 

The use of the principle is easily seen in general terms. If there is 
originally no ground to believe one of a set of alternatives rather than 
another, the prior probabilities are equal. The most probable, when 
evidence is available, will then be the one that was most likely to lead 
to that evidence. We shall be most ready to accept the hypothesis that 



30 


FUNDAMENTAL NOTIONS 


Chap. I 


requires the fact that the observations have occurred to be the least 
remarkable coincidence. On the other hand, if the data were equally 
likely to occur on any of the hypotheses, they teU us nothing new with 
respect to their credibility, and we shall retain our previous opinion, 
whatever it was. The principle will deal with more complicated circum- 
stances also; the immediate point is that it does provide us with what we 
want, a formal rule in general accordance with common sense, that will 
guide us in our use of experience to decide between hypotheses. 

1.23. We have not yet shown that Convention 2 is a convention and 
not a postulate. This must be done by considering other possible conven- 
tions and seeing what results they lead to. Any other convention must 
not contradict Axiom 4. For instance, if the number associated with a 
probability by our rules is x, we might agree instead to use the number 
e®. Then if x and x' are the present estimates for the propositions q 
and q', and for r and r', those for q'tq' and rvr' will both be e®+®' and 
the consistency rule of Axiom 4 will be satisfied. But instead of the 
addition rule for the number to be associated with a disjunction we 
shall have a product rule. Every proposition stated in either notation 
can be translated into the other; if our present system leads to the result 
that a hypothesis is as likely to be true as it is that we should pick a 
white ball at random out of a bag containing 99 white ones and 1 black 
one, that result will also be obtained on the suggested alternative system. 
The fundamental notion is that of the comparison of reasonable degrees 
of belief, and so long as all methods place them in the same order the 
differences between the methods are conventional. This will be satisfied 
if instead of the number x we choose any function of it, f(x), such that 
X and /(x) are increasing functions of each other, so that for any value 
of one the other is determinate. This is necessary by Convention 1 
and Axiom 1, but every form of/(x) will lead to a different rule for the 
probability-number of a disjunction if it is to be consistent with 
Axiom 4. Hence the addition rule is a convention. It is, of course, 
much the easiest convention to use. To abandon Convention 1, con- 
sistently with Axiom 1, would merely arrange aU numerical assessments 
in the opposite order, and again the same results would be obtained in 
translation. The assessment by numbers is simply a choice of the 
most convenient language for our purposes. 

1.3. The original development of the theory, by Bayes,t proceeds 
differently. The foregoing account is entirely in terms of rules for the 


t PhU. Trans. 53, 1763, 376-98. 



§1.3 FUNDAMENTAL NOTIONS 31 

comparison of reasonable degrees of belief. Bayes, however, takes as 
his fundamental idea that of expectation of benefit. This is partly a 
matter of what we want, which is a separate problem from that of what 
it is reasonable to believe; I have therefore thought it best to proceed 
as far as possible in terms of the latter alone. Nevertheless, we have in 
practice often to make decisions that involve not only belief but the 
desirabihty of the possible effect of different courses of action. If we 
have to give advice to a practical man, either we or he must take these 
into account. In deciding on his course of action he must allow both 
for the probability that the action chosen will lead to a certain result 
and for the value to him of that result if it happens. The fullest 
development on these lines is that of F. P. Ramsey.f I shall not 
attempt to reproduce it, but shall try to indicate some of the principal 
points as they occur in his work or in Bayes’s. The fundamental idea 
is that the values of expectations of benefit can be arranged in an order; 
it is legitimate to compare a small probabihty of a large gain with 
a large probability of a small gain. The idea is necessarily more com- 
plicated than my Axiom 1 ; on the other hand, the comparison is one 
that a business man often has to make, whether he wants to or not, or 
whether it is legitimate or not. The rule simply says that in given 
circumstances there is always a best way to act. The comparison of 
probabilities follows at once; if the benefits are the same, whichever 
of two events happens, then if the values to us of the expectations of 
benefit differ it is because the events are not equally likely to happen, 
and the larger value is associated with the larger probability. Now we 
have to consider the combination of expectations. Here Bayes, I think, 
overlooks the distinction between what Laplace calls ‘mathematical’ 
and ‘moral’ expectation. Bayes speaks in terms of monetary stakes, 
and would say that a 1/100 chance of receiving £100 is as valuable as 
a certainty of receiving £1. A gambler might say that it is more valuable; 
most people would perhaps say that it is less so. Indeed Bayes’s 
definition of a probability of 1/100 would be that it is the probability 
such that the value of the chance of receiving £100 is the same as 
the value of a certain £1. Since different values may be compared, the 
uniqueness of a probability so defined requires a postulate that the 
value of the expectation, the proposition and the data remaining 
the same, is proportional to the value to be received if the proposition 

t The Foundations of Mathematics, 1931, pp. 167-211. This essay, like that of Bayes, 
was published after the author’s death, and sufiers from a number of imperfections in 
the verbal statement that he might have corrected. 



32 


FUNDAMENTAL NOTIONS 


Chap. I 


is true. This is taken for granted by Bayes, and Ramsey makes an 
equivalent statement (foot of p. 179). The difficulty is that the value of 
£l to us depends on how much money we have already. This point was 
brought out by Daniel Bernoulli in relation to what was called the 
Petersburg Problem. Two players play according to the following rules. 
A coin is to be thrown until a head is thrown. If it gives a head on 
the first throw, A is to pay R £l ; if the first head is on the second throw, 
£2; on the third, £4; and so on. What is the fair sum for B to pay A 
for his chances ? The mathematical expectation in pounds is 

i - l + i-2+^.4-|--^.8+... = 00. 

Thus on this analysis B should pay A an infinite sum. If we merely 
consider a large finite sum, such as £2^®, he will lose if there is a head 
in any of the first 20 throws; he will gain considerably if the first head 
is on the 21st or a later throw. The question was, is it really worth 
anybody’s while to risk such a sum, most of which he is practically 
certain to lose, for an almost inappreciable chance of an enormous 
gain? Even eighteenth-century gamblers seem to have had doubts 
about it. Daniel Bernoulli’s solution was that the value of £2®® is very 
different according to the amount we have to start with . The value of 
a loss of that sum to anybody that has just that amount is not equal 
and opposite to the value of a gain of the same sum. He suggested a 
law relating the value of a gain to the amount already possessed, which 
need not detain U8;t but the important point is that he recognized that 
expectations of benefit are not necessarily additive. What Laplace calls 
‘moral expectation’ is the value or pleasure to us of an event; its rela- 
tion to the monetary value in terms of mathematical expectation may 
be rather remote. Bayes wrote after Bernoulli, but before Laplace, 
but he does not mention Bernoulli. Nevertheless, the distinction does 
not dispose of the interest of the treatment in terms of expectation of 
benefit. Though we cannot regard the benefits of gains of the same 
kind as mutually irrelevant, on account of this psychological pheno- 
menon of satiety, there do seem to be many cases where benefits are 
mutually irrelevant. For instance, the pleasures to me of two dinners 
on consecutive nights seem to be nearly independent, though those of 
two dinners on the same night are definitely not. The pleasures of the 
\mexpected return of a loan, having a paper accepted for publication, 
a swim in the afternoon, and a theatre in the evening do seem 

t It is that the value of a gain dx, when we have x already, is jjroportional to dx/x ; 
this is the rule associated in certain biological applications with the names of Weber 
and Feohner. 



§ 1.3 


FUNDAMENTAL NOTIONS 


33 


independent. If there are a suflSicient number of such benefits (or if there 
could be in some possible world, since all we need is consistency), a 
scale of the values of benefits can be constructed, which will satisfy the 
commutative rule of addition, and then, by Bayes’s principles, one of 
probability in terms of them. The addition rule will then be a theorem. 
The product rule is treated by Bayes in the following way. We can 
write E(a,p \q) for the value of the expectation of receiving a if p is 
true, given q, and by definition of P{p j q), 

E{a,p\q) == aP{p\q). 

The proportionality of E(a,p \ q) to o, given p and q, is a postulate, as 
we have already stated. Consider the value of the expectation of 
getting a if p and q are both true, given r. This is aP{pq j r). But we 
may test p first and then q. lip turns out to be true, our expectation 
will be aP(q \pr), since p is now among our data; if untrue, we know 
that we shall receive nothing. Now return to the first stage. If p is 
true we shall receive an expectation, whose value is aP(q | pr), otherwise 
nothing. Hence our initial expectation is aP(q \pr)P(p | r); whence 

P{pq\r) = P{p\r)P{q \pr). 

Ramsey’s presentation is much more elaborate, but depends on the 
same main ideas. The proof of the principle of inverse probability is 
simple. The difficulty about the separation of propositions into dis- 
junctions of equaUy possible and exclusive alternatives is avoided by 
this treatment, but is replaced by difficulties concerning additive expec- 
tations. These are hardly practical ones in either case; no practical man 
will refuse to decide on a course of action merely because we are not 
qmte sure which is the best way to lay the foundations of the theory. 
He assumes that the course of action that he actually chooses is the best; 
Bayes and Ramsey merely make the less drastic assumption that there 
is some course of action that is the best. In my method expectation 
would be defined in terms of value and probability; in theirs probability 
is defined in terms of values and expectations. The actual propositions 
are of course identical. 

1.4. At any stage of knowledge it is legitimate to ask about a given 
hypothesis that is accepted, ‘How do you know?’ The answer will 
usually rest on some observational data. If we ask further, ‘What did 
you think of the hypothesis before you had these data?’ we may be 
told of some less convincing data; but if we go far enough back we shall 
always reach a stage where the answer must be; ‘I thought the matter 

3fit6.58 n 



34 


FUNDAMENTAL NOTIONS 


Chap. I 


worth considering, but had no opinion about whether it was true.’ 
What was the probability at this stage ? We have the answer already. 
If there is no reason to believe one hypothesis rather than another, the 
probabilities are equal. In terms of our fundamental notions of the 
nature of inductive inference, to say that the probabilities are equal is 
a precise way of saying that we have no ground for choosing between the 
alternatives. All hypotheses that are sufficiently definitely stated to 
give any difference between the probabilities of their consequences 
will be compared with the data by the principle of inverse probability; 
but if we do not take the prior probabilities equal we are expressing 
confidence in one rather than another before the data are available, and 
this must be done only from definite reason. To take the prior probabili- 
ties different in the absence of observational reason for doing so would 
be an expression of sheer prejudice. The rule that we should then take 
them equal is not a statement of any belief about the actual composi- 
tion of the world, nor is it an inference from previous experience; it is 
merely the formal way of expressing ignorance. It is sometimes referred 
to as the Principle of Insufficient Reason (Laplace) or the equal dis- 
tribution of ignorance. Bayes, in his great memoir, repeatedly says 
that the principle is to be used only in cases where we have no ground 
whatever for choosing between the alternatives. It is not a new rule 
in the present theory because it is an immediate application of Conven- 
tion 1. Much confusion has arisen about it through misunderstanding 
and attempts to reinterpret it in terms of frequency definitions. My 
contention is that the frequency definitions themselves lead to no 
results of the kind that we need until the notion of reasonable degree 
of belief is reintroduced, and that since the whole purpose of these 
definitions is to avoid this notion they necessarily fail in their object. 
When reasonable degree of belief is taken as the fundamental notion 
the rule is immediate. We begin by making no assumption that one 
alternative is more likely than another and use our data to compare them . 

Suppose that one hypothesis is suggested by one person A, and 
another by a dozen B, C,...; does that make any difference? No; but 
it means that we have to attend to two questions instead of one. First, 
is p or g true? Secondly, is the difference between the suggestions due 
to some psychological difference between A and the rest? The mere 
voting is not evidence because it is quite possible for a large number 
of people to make the same mistake. The second question cannot be 
answered until we have answered the first, and the first must be con- 
sidered on its merits apart from the second. 



§ 1.6 


FUNDAMENTAL NOTIONS 


36 


1.5. We are now in a position to consider whether we have fulfilled the 
conditions that we required at the outset. I think (1) is satisfied, though 
the history of both probability and deductive logic is a warning against 
over-confidence that an unstated axiom has not slipped in. 

2. Axiom 1 assumes consistency, but this assumption by itself does 
not guarantee that a given system is consistent. It makes it possible 
to derive theorems by equating probabilities found in different ways, 
and if in spite of all efforts probabilities found in different ways were 
different, the axiom would make it impossible to accept the situation 
as satisfactory. We must not expect too much in the nature of a general 
proof of consistency. There is a theorem due to Gbdel that if any logical 
system that includes arithmetic contained a proof of its own consis- 
tency, it would also contain one of its own inconsistency; so apparently 
it would be fatal to a system if we could find a general proof of consis- 
tency within it. Proofs of the consistency of various logical schemes 
(including the system of Principia Mathematica and therefore the 
theory of functions of a real variable) do exist, but only by going out- 
side the frames of the schemes themselves. The proof amounts to 
finding a proposition that can be stated in the system but cannot be 
proved or disproved by using the rules of the system. Since the system 
of Principia contains a proposition that two contradictory propositions 
imply any proposition, the existence of an undemonstrable proposition 
implies that the primitive propositions in the system are consistent. 
But this argument itself cannot be expressed in Principia language! 
What we want is that the probability of a proposition on the same data 
shall always be the same; thus, if we are considering two alternative 
hypotheses and g'g, our previous information is H, and the new 
evidence consists of two batches of data p^ and pg, the assessments on 
data P1P2H should be the same whether we take p^ or p^ into account 
first or both at once. Now, by the principle of inverse probability, 

Pi9i \PiH) ^ Plq^lPiH) 

P{<li I Ii)P{Pi I <liH) P(q2 I H)P(pi Iq^H)' 

Replacing H hy p^H and p^ by pg we shall obtain the result for the 
application of the additional data pg, p^ being now already given : 

P{<h \P 1 P 2 H) ^ P(.q2 \PiP2H) 

Piii \P iH)P(P2\^iPiH) Pi22\PiH)P{p2\q2PiHy 

Multiplying, we have 

P(qi\PiP2H) ^ P{q2\PiP2Pl) 

P(qi I H)PiPi I H)PiP2 IPiQiH) P(gj 1 H)P(PI 1 grg H)P{p2 \ Pi q^H) ‘ 



36 FUNDAMENTAL NOTIONS CJhap. I 

But by Axiom 7, assuming that the product rule holds for likelihoods, 

= P(PiPz\qiH), 

and therefore 

P{<li I P)P{PiPt. I H) Ptoa I H)P{p^p^ I q^H) ’ 

which is the result of applying the principle of inverse probability to 
take account of the data and simultaneously. By symmetry we 
should obtain the same result if we took account of p^ first. Extension 
to any number of batches of new data is obviously possible, and the 
results will therefore be consistent provided that we always start with 
the same data and finish with the same, and that we take account of 
the new data as we proceed. Neglect of the last condition may lead 
to inconsistencies, but that is the result of not applying the principle 
correctly. In the proof we have assumed that the product rule holds 
for hkelihoods. This has not been proved in general, but has inv'ariably 
been assumed even by those who claim to reject the principle of inverse 
probability. What our theorem shows is that if the product rule holds 
for likelihoods the principle of inverse probability cannot lead to 
contradiction. 

The consistency of the product rule can be treated more directly as 
follows. Let q^, be two sets of propositions each exclusive and 
exhaustive on p, and denote their disjunctions by Q, R. Then 

PiTk \P) = PiQ^k \P) = X \P)- 

I 

Instead of Axiom 7 assume that 

P{ii\‘<‘kP) ^ P{<li'<’k\ p) 

P{<lj\rkP) P(<l}^k\py 

and assume that probabilities on data p satisfy the axioms. Then for 
probabilities on data r^p it is obvious that Axioms 1, 2, 5 are satisfied: 
Axioms 3, 4, 6, 7 are easily proved, beginning with Axiom 6. Hence 
if we weaken Axiom 1 to a statement that probabilities are comparable 
given one sufficiently wide datum p, we can consistently convert the 
product rule into a definition of probabilities on data including p. 

3. For any assessment of the prior probability the principle of inverse 
probability will give a unique posterior probability. This can be used 
as the prior probability in taking account of a further set of data, and 
the theory can therefore always take account of new information. The 
choice of the prior probability at the outset, that is, before taking into 
account any observational information at all, requires further con- 
sideration. We shall see that further principles are available as a guide. 



§1.5 


FUNDAMENTAL NOTIONS 


37 


These principles sometimes indicate a unique choice, but in many 
problems some latitude is permissible, so far as we know at present. 
In such cases, and in a different world, the matter would be one for 
decision by the International Research Council. Meanwhile we need 
only remark that the choice in practice, within the range permitted, 
makes very little difference to the results. 

4. This is satisfied by definition. 

6. We have avoided contradicting rule 5 so far, but further applica- 
tions of it will appear later. 

6. Our main postulates are the existence of unique reasonable degrees 
of belief, which can be put in a definite order; Axiom 4 for the consistency 
of probabilities of disjunctions; either the axiomatic extension of the 
product rule or the theory of expectation. It does not appear that these 
can be reduced in number, without making the theory incapable of 
covering the ground required. 

7. The simple cases mentioned on pp. 29-30 show how the principle of 
inverse probability does correspond to ordinary processes of learning, 
though we shall go into much more detail as we proceed. Differences 
between individual assessments that do not agree with the results of 
the theory will be part of the subject-matter of psychology. Their 
existence can be admitted without reducing the importance of a unique 
standard of reference. It has been said that the theory of probability 
could be accepted only if there was experimental evidence to support 
it; that psychology should invent methods of measuring actual degrees 
of belief and compare them with the theory. I should reply that without 
an impersonal method of analysing observations and drawing inferences 
from them we should not be in a position to interpret these observations 
either. The same considerations would apply to arithmetic. To quote 
P. E. B. Jourdainif 

T sometimes feel inclined to apply the historical method to the multiplication 
table. 1 should make a statistical inquiry among school children, before their 
pristine -wisdom had been biased by teachers. I should put do-wn their answers 
as to what 6 times 9 amounts to, I should work out the average of their answers 
to six places of decimals, and should then decide that, at the present stage of 
human development, this average is the value of 6 times 9.’ 

I would add only that without the multiplication table we should not 
be able to say what the average is. Nobody says that wrong answers 
invalidate arithmetic, and accordingly we need not say that the fact 
that some inferences do not agree with the theory of probability 


•f The Philosophy of Mr. B*rtr*nd R*sa*ll, 1918, p. 88. 



38 


FUNDAMENTAL NOTIONS 


Chap. I 


invalidates the theory. It is sufficiently clear that the theory does 
represent the main features of ordinary thought. The advantage of a 
formal statement is that it makes it easier to see in any particular case 
whether the ordinary rules are being followed. 

This distinction shows that theoretically a probability should always 
be worked out completely. We have again an illustration from pure 
mathematics. What is the 1,000th figure in the expansion of e? 
Nobody knows; but that does not say that tlie probability that it is 
a 6 is 0*1. By following the rules of pure mathematics we could deter- 
mine it definitely, and the statement is either entailed by the rules or 
contradicted; in probabihty language, on the data of pure mathematics 
it is either a certainty or an impossibility, f Similarly, a guess is not 
a probability. Probability theory is more comphcated than deductive 
logic, and even in pure mathematics we must often be content with 
approximations. Mathematical tables consist entirely of approxima- 
tions. Hence we must expect that our numerical estimates of proba- 
bilities in practice will usually be approximate. The theory is in fact 
the system of thought of an ideal man that entered the world knowing 
nothing, and always worked out his inferences completely, just as pure 
mathematics is part of the system of thought of an ideal man who 
always gets his arithmetic right, f But that is no reason why the actual 
man should not do his best to approximate to it. 


1 . 6 . We can now indicate in general terms how an inductive inference 
can approach certainty, though it cannot reach it. If g' is a hypothesis, 
H the previous information, and an experimental fact, we have by 
two applications of the product rule, using Convention 3, 




P{q I H)P{p^ I qH) 


( 1 ) 


since both are equal to P(Px<l 1 P)IP(Pi \ P)- If Px is a consequence of 
q, P(pi I qH) — 1 ; hence in this case 


P{q \PiH) = 


P(g\H) 

P{Pi\Hy 


( 2 ) 


t It is unfortunate that pure mathematicianB speak of, for instance, the probability 
distribution of prime numbers, meaning a smoothed density distribution. Systematic 
botanists and zoologists are far ahead of mathematicians and physicists in tidying up 
their language. 

X An export computer does not trust his arithmetic without applying checks, which 
would give identities if the work is correct but would be expected to fail if there is 
a mistake. Thus induction is used to check the correctness of what is meant to be 
deduction. The possibility that two mistakes hove cancelled is treated as so improbable 
that it can be ignored. 



51.6 


FUNDAMENTAL NOTIONS 


39 


If ^ 1 , p^,-- are further consequences of q, which are found to be true, we 
shall have in succession 




P(q\H) 

p(p^\H)P(Pi\p^Hy 


Piq\PiP2-PvP) 


P{q\H) 

PiPl \P)P(P2\PlP)-PiPn \Pl -Pn-lPy 


( 3 ) 


Thus each verification divides the probability of the hypothesis by the 
probability of the verification, given the previous information. Thus, 
with a sufficient number of verifications, one of three things must 
happen: (1) The probability of q on the information available will 
exceed 1. (2) it is always 0. (3) P(p„ \PiPi--Pn-iP) will tend to 1. 

(1) is impossible since the highest degree of probability is certainty. 

(2) means that q can never reach a finite probability, however often it 
is verified. But if we adopt (3), repeated verifications of consequences 
of a hypothesis will make it practically certain that further consequences 
of it will be verified. This accounts for the confidence that we actually 
have in inductive inferences. 


This proposition also provides us with an answer to various logical 
difficulties connected witli the fact that if p entails g, q does not neces- 
sarily entail p. p may be one of many alternatives that would also 
entail q. In the lowest terms, if q is the disjunction of a set of alterna- 
tives q^, g 2 "”> S'm' then any member of this set entails q, but q does not 
entail any particular member. Now in science one of our troubles is 
that the alternatives available for consideration are not always an 
exhaustive set. An unconsidered one may escape attention for centuries. 
The last proposition shows that this is of minor importance. It says 
that if p^,..., p„ are successive verifications of a hypothesis q, 


Pip,, \Pip2-Pn-iP) 


will approach certainty; it does not involve q and therefore holds 
whether q is true or not. The unconsidered hypothesis, if it had been 
thought of, would either (1) have led to the consequences p^, Pa,... or (2) 
to different consequences at some stage. In the latter case the data 
would have been enough to dispose of it, and the fact that it was not 
thought of has done no harm. In the former case the considered and 
the unconsidered alternatives would have the same consequences, and 
will presumably continue to have the same consequences. The un- 
considered alternative becomes important only when it is explicitly 
stated and a type of observation can be found where it would lead to 
different predictions from the old one. The rise into importance of the 



40 


FUNDAMENTAL NOTIONS 


Chap. I 


theory of general relativity is a case in point. Even though we now 
know that the systems of Euclid and Newton need modification, it was 
still legitimate to base inferences on them until we knew what particular 
modification was needed. The theory of probability makes it possible 
to respect the great men on whose shoulders we stand. 

The possibility of this procedure rests, of course, on the fact that 
there are cases where a large number of observations have been found 
to agree with predictions made by a law. The interest of an estimate of 
the probability of a law, given certain data, is not great unless those 
actually are our data. Indeed, a statement of it might lead to highly 
uncomphmentary remarks. It is not necessary that the predictions 
shall be exact. In the case of uniformly accelerated motion mentioned 
near the beginning, if the law is stated in the form that at any instant 

the observed a will lie between where e is small com- 

pared with the whole range of variation of e, it will still be a legitimate 
inference after many verifications that the law will hold in future 
instances within this margin of uncertainty. This takes us a further 
step towards understanding the nature of the acceptance of a simple 
law in spite of the fact that in the crude form given in applied mathe- 
matics it does not exactly agree with the observations. 

1.61. If we lump together all hypotheses that give indistinguishable 
consequences, their total probability will tend to 1 with sufficient 
verification. For if we have a set of hypotheses g^..., all asserting 
that a quantity x will lie in a range ±€, we may denote their disjunction 
by q, which will assert the same. Suppose that ~ q would permit the 
quantity to lie in a range d; E, where E is much greater than e. Suppose 
further that x is measured and found to be in the range indicated by q. 
Then if p denotes this proposition, P{p | qh) = 1, and P{p | ~ qh) is of 
order ejE. Hence 

P{q\p^) _ o(E\ P(q\h) 

P(r^q\ph) \€)P(r-^q\hy 

Thus if Eje is large and q is a, serious possibility, a single verification 
may send its probability nearly up to 1. It is an advantage to consider 
together in this way all hypotheses that would give similar inferences 
and treat their disjunction as one hypothesis. The data give no informa- 
tion to discriminate between them so long as the data are consequences 
of all; the posterior probabilities remain in the ratios of the prior 
probabilities. With this rule, therefore, we can with a few verifications 
exclude from serious consideration any vaguely stated hypotheses that 
would require the observed results to be remarkable coincidences; while 



§ 1.6 


FUNDAMENTAL NOTIONS 


41 


unforeseen alternatives whose consequences would agree with those 
given by hypotheses already included in q, within the range of verifica- 
tion at any stage, will give no trouble. By the time when any of them 
is stated explicitly, all hypotheses not implying values of x within the 
ranges actually found will have negligible probabilities anyhow, and all 
that we shall need to do is to separate the disjunction q as occasion 
arises. It is therefore desirable as far as possible to state hypotheses 
in such a form that those with indistinguishable consequences can be 
treated together; this will avoid mere mathematical complications 
relating to possibilities that we have no means of testing. 

1.7. Theorem 11. If q^, q^,--, q„ are a set of exclusive alternatives on 
data r, and if 

P{p\lir) = P(p l^jr) = ... = P(p |g„r), 
then each — P(p [q^w q^.-.y q„ -r). 

For if we denote the disjunction q^y q^... y q„hy q, we have 

Pipq\r) ^ P{pqi\r)+P(pq^\r)+... (1) 

since these alternatives are mutually exclusive; and this 

= P(p\qir)Piqi\r)+.... (2) 

The first factors are all equal, and the sum of the second factors is 
P{q I r). Hence 

P(pq\r) = P(p\qir)P(q\r). (3) 

But P{pq 1 r) = P{p 1 qr)P(q | r), (4) 

which gives the theorem on comparison with (3). 

This leads to the principle that we may call the suppression of an 
irrelevant premiss. If ?! vg',... is entailed by r, 

P(p I qr) = P{pq \ r) = P{p | r), 

since P(q \pr) = 1; and then each of the expressions P{p Ig’fr) is equal 
to P{p \r). In words, if the probability of a proposition is the same 
for all the alternative data consistent with one fixed datum, then the 
probability on the fixed datum alone has the same value. 

The interest of this theorem is primarily in relation to what are called 
‘chances’, in a teclinical sense given by N. R. Campbell and M. S. 
Bartlett. We have seen that probabilities of propositions in general 
depend on the data. But cases can be stated, and whether they exist 
or not must be considered, where the probability is the same over a 
wide range of data; in such a case we may speak of the information 
not common to all these data as irrelevant to the probability of the 



FUNDAMENTAL NOTIONS 


42 


Chap. I 


proposition. Thus above we can say that the propositions are 

irrelevant to p, given r. Further, 

PiMi I »•) = P{^i I r)P{p I gf r) = P{qi \ r)P(p \ r), 
so that the product formula in such a case is legitimately replaced by 
the form (2) on p. 27. I shall therefore define a chance as follows: If 
iv^2T- >qn ® of alternatives, mutually exclusive and exhaustive on 
data r, and if the probabilities of p given any of them and r are the same, 
each of these probabilities is called the chance of p on data r. It is equalf 
to P{p I r). 

In any case where r includes the specification of all the parameters 
in a law, and the results of previous trials are irrelevant to the result 
of a new trial, the probability of a given result at that trial is the chance 
on data r. For the information available just before that trial is made 
is composed of r and the results of all previous trials. If we consider 
the aggregate of all the results that might have been obtained in pre- 
vious trials, they constitute a set of alternatives such that one of them 
must occur on data r, and are exclusive and exhaustive. Given then 
that the probabihty of an event at the next trial is the same whatever 
the results of previous trials, it must be equal to the chance on data r. 
It follows that the joint probability on data r of the results of several 
trials is the product of their separate chances on data r. This can easily 
be proved directly. For if PuPav.Pm are the results in order, we have 
by successive applications of the product formula 

P(PiPi- Pm 1 »■) = PiPi I r)P{P 2 I Pi r)P{P3 I PiPz r)...P{p„, I Pi...p„,-i r), 

and by the condition of irrelevance this is equal to 
P{Pi I r)P{Pi I r)P(p3 I r)...P(p„ | r). 

This is usually taken for granted, but it is just as well to have it proved. 

When the probabihties, given the law, are chances, they satisfy the 
product rule automatically. Hence our proof of the consistency of the 
principle of inverse probability is complete in all cases where the likeli- 
hoods are derived from chances. This covers nearly all the applications 
in this book. 

Theorem 12. IfP\,Pi,-",Pm and q^, q^,..., g„ are two sets of alternatives, 
each exclusive and exhaustive on data r, and if 

P^Ps^Ar) =f{PsMit) 

t Bayes and Laplace use both words ‘probability’ and ‘chance’, but so far as I know 
do not specify any distinction between them. There are, however, passages in their 
writings that suggest that they use the words with their modern senses interchanged. 



FUNDAMENTAL NOTIONS 


43 


! 1.7 

for all values of a and t, where f(pg) depends mly on Pg and r, and g(qf) 
only on and r, then 

P(Pb I r) oc f{pg) ; P{q, \ r) oc gr(g,) . 

For if we denote the disjunctions of the pg and g, by p and q, we have 


P(i\ ? k) = 2 P(Pa <li\r)= f{Ps) ^giqi), ( 1 ) 

which is proportional to f(pg). But 

PiPs q\r) = P(Pg I r)P(q \pgr) (2) 

and the last factor is 1 since q is entailed by r. Hence 

PiPB\r)^f(PB)- ( 3 ) 

Similarly, Piqilr) oc g(qi). (4) 

We notice that 

P{pq I ^) = 1 2 f(Ps)9(qi) = 2 fiPs) 2 qiqt) (s) 

8 t 8 t 


and is equal to 1 since p and q are both entailed by r. It is possible 
to multiply f(Pg) and g(gj by factors such that both sums will be equal 
to 1 ; these factors will be reciprocals; and if this is done, since p and q 
separately are entailed by r, we shall have 

Pipe I >■) = fiPs)'> P(qt I »■) = qiqt)- 

Also P{pg I g< r) = P(p,g, | r)/P(g^ | r) = f{pg) (6) 

and g, is irrelevant to pg. 

This theorem is useful in cases where a joint probability distribution 
breaks up into factors. 

1 . 8 . Expectation of benefit is taken as a primitive idea in the Bayes- 
Ramsey theory. In the present one we can define the expectation of 
a function f{x) on data p by the equation 

E{fi^) \p] = I.fi^)P{^ b) 

taken over all values of x. For expectation of benefit, if benefits inter- 
fere, there is no great trouble. If x is, for instance, a monetary gain, 
we need only distinguish between x itself, the expectation of which 
will be 2 xP{x Ip), and the benefit to us of x, which is not necessarily 
proportional to x. If it is f(x), the expectation of benefit will be 

2/(^) Pi^\P)- 

The expectations of functions of a variable are often required for our 
purposes, though we shall not have much more to say about expecta- 
tion of benefit. But attention must be called at once to the fact that 
if the expectation of a variable is a, it does not mean that we expect 



44 


FUNDAMENTAL NOTIONS 


Chap. I 


the variable to be near a. Consider the following case. Suppose that 
we have two boxes A and B each containing n balls. We are to toss 
a coin; if it comes down heads we shall transfer all the balls from A to 
B-, if tails, all from jB to .4. What is our present expectation of the 
number of balls in A after the process ? There is a probability | that 
there will be 2n balls in A, and a probability I that there will be none. 
Hence the expectation is n, which is not a possible value at all. Incor- 
rect results have often been obtained by taking an expectation as a 
prediction of an actual value; this can be done only if it is also shown 
that the probabilities of different actual values are closely concentrated 
about the expectation. It may easily happen that they are concen- 
trated about two or more values, none of which is anywhere near the 
expectation. 

1.9. It may be noticed that the words ‘idealism’ and ‘realism’ have 
not yet been used. I should perhaps explain that their use in everyday 
speech is different from the philosophical use. In everyday use, realism 
is thinking that other people are worse than they are; idealism is 
thinking that they are better than they are. The former is an expres- 
sion of praise, the latter of disparagement. It is recognized that nobody 
sees himself as others see him; it follows that everybody knows that 
everybody else is either a reahst or an idealist. In philosophy, realism 
is the belief that there is an external world, which would still exist if 
we were not available to make observations, and that the function of 
scientific method is to find out properties of this world. Idealism is the 
belief that nothing exists but the mind of the observer or observers 
and that the external world is merely a mental construct, imagined to 
give us ourselves a convenient way of describing our experiences. The 
extreme form of idealism is solipsism, which, for any individual, asserts 
that only his mind and his sensations exist, other people’s minds also 
being inventions of his own. The methods developed in this book are 
consistent with some forms of both realism and idealism, but not with 
solipsism; they contribute nothing to the settlement of the main ques- 
tion of idealism versus realism, but they do lead to the rejection of 
various special cases of both. I am personally a realist (in the philo- 
sophical sense, of course) and shall speak mostly in the language of 
realism, which is also the language of most people; but if an idealist 
wishes to translate anything in this book into the language of idealism, 
I think he will be able to do it. To him I offer the bargain of the 
Unicom with Alice: ‘If you’ll believe in me. I’ll believe in you.’ 



§ 1.9 


FUNDAMENTAL NOTIONS 


46 


Solipsism is not, as far as I know, actively advocated by anybody 
(with the possible exception of the behaviourist psychologists). The 
great difficulty about it is that no two solipsists could agree. If A and 
B are solipsists, A thinks that he has invented B and vice versa. The 
relation between them is that between Alice and the Red King; but 
while Alice was willing to believe that she was imagining the King, she 
found the idea that the King was imagining her quite intolerable. 
Tweedledum and Tweedledee solved the problem by accepting the 
King’s solution and rejecting Alice’s; but every solipsist must have his 
own separate solipsism, which is flatly contradictory to every other’s. 
Nevertheless, .solipsism does contain an important principle, recognized 
by Karl Pearson, that any person’s data consist of his own individual 
experiences and that his opinions are the result of his own individual 
thought in relation to those experiences. Any form of realism that 
denies this is simply false. A hypothesis does not exist till some one 
person has thought of it; an inference does not exist until one person 
has made it. We must and do, in fact, begin with the individual. 
But early in life he recognizes groups of sensations that habitually occur 
together, and in particular he notices resemblances between those 
groups that we, as adults, call observations of oneself and other people. 
When he learns to speak he has already made the observation that 
some sounds belonging to these groups are habitually associated wdth 
other groups of visual or tactile sensations, and has inferred the rule 
that we should express by saying that particular things and actions are 
denoted by particular words; and when he himself uses language he 
has generalized the rule to say that it may be expected to hold for 
future events. 

Thus the use of language depends on the principle that generalization 
from experience is possible; and this is far from being the only such 
generalization made in infancy. But if we accept it in one case we 
have no ground for denying it in another. But a person also observes 
similarities of appearance and behaviour between himself and other 
people, and as he himself is associated with a conscious personality, it 
is a natural generalization to suppose that other people are too. Thus 
the departure from solipsism is made possible by admitting the pos- 
sibility of generalization. It is now possible for two people to under- 
stand and agree with each other simultaneously, which would be 
impossible for two solipsists. But we need not say that nothing is to 
be believed until everybody believes it. The situation is that one person 
makes an observation or an inference; this is an individual act. If he 



46 


FUNDAMENTAL NOTIONS 


Chap. I 


reports it to anybody else, the second person must himself make an 
individual act of acceptance or rejection. All that the first can say is 
that, from the observed similarities between himself and other people, 
he would expect the second to accept it. The facts that organized 
society is possible and that scientific disagreements tend to disappear 
when the participants exchange their data or when new data accumu- 
late are confirmation of this generalization. Regarded in this way the 
resemblance between individuals is a legitimate induction, and to take 
universal agreement as a primary requisite for belief is a superfluous 
postulate. 

Whether one is a realist or an idealist, the problem of inferring future 
sensations arises, and a theory of induction is needed. Both some 
realists and some idealists deny this, holding that in some way future 
sensations can be inferred deductively from some intuitive knowledge 
of the possible properties of the world or of sensations. If experience 
plays any part at all it is merely to fill in a few details. This must be 
rejected under rule 5. I shall use the adjective ‘naive’ for any theory, 
whether realist or idealist, that maintains that inferences beyond the 
original data are made with certainty, and ‘ critical ’ for one that admits 
that they are not, but nevertheless have validity. Nobody that ever 
changes his mind through evidence or argument is a naive realist, 
though in some discussions it seems to be thought that there is no 
other kind of realism. It is perfectly possible to believe that we are 
finding out properties of the world without believing that anything we 
say is necessarily the last word on the matter. 

It should be remarked that some philosophers define ‘naif realism’ 
in some such terms as ‘the belief that the external world is something 
like our perception of it’, and argue in its favour. To quote a remark 
I once heard Russell make, ‘I wonder what it feels Uke to think that.’ 
The succession of two-dimensional impressions that wt call visual 
observations is nothing like the three-dimensional world of science, 
and I cannot think that such a hypothesis merits serious discussion. 
The trouble is that many philosophers are as far as most scientists 
from appreciating the long chain of inference that connects observation 
with the simplest notions of objects, and many of the problems that 
take up most attention are either solved at once or are seen to be 
insoluble when we analyse the process of induction itself. 



II 


DIRECT PROBABILITIES 

‘Having thus exposed the far-seeing Mandarin’s inner thoughts, would it be 
too excessive a labour to penetrate a little deeper into the rich mine of strategy 
and disclose a specific detail ? ’ 

Ernest Bramah, Kai Lung Unrolls his Mat 

2.0. We have seen that the principle of inverse probability can be 
stated in the form 

Posterior Probability oc Prior Probability X Likelihood, 

where by the bkebhood we understand tbe probabibty that the observa- 
tions should have occurred, given the hypothesis and the previous 
knowledge. The prior probability of the h5rpothesi8 has nothing to do 
with the observations immediately under discussion, though it may 
depend on previous observations. Consequently the whole of the in- 
formation contained in the observations that is relevant to the posterior 
probabilities of different hypotheses is summed up in the values that 
they give to the likelihood. In addition, if the observations are to tell 
us much that we do not know already, the likelihood will have to vary 
much more between different hypotheses than the prior probability 
does. Special attention is therefore needed to the discussion of the 
probabilities of sets of observations given the hypotheses. 

Another consideration is that we may be interested in the likelihood 
as such. There are many problems, such as those of games of chance, 
where the hypothesis is trusted to such an extent that the amount of 
observational material that would induce us to modify it would be far 
larger than will be available in any actual trial. But we may want to 
predict the result of such a game; or a bridge player may be interested 
in such a problem as whether, given that he and his partner have nine 
trumps between them, the remaining four are divided two and two. 
This is a pure matter of inference from the hypothesis to the probabili- 
ties of different events. Such problems have already been treated at 
great length, and I shall have little to say about them here, beyond 
indicating their general position in the theory. 

In Chapter I we were concerned mainly with the general rules that a 
consistent theory of induction must follow. They say nothing about 
what laws actually connect observations; they do provide means of 
choosing between possible laws, in accordance with their probabilities 



48 


DIRECT PROBABILITIES 


Chap . II 


given the observations. The laws themselves must be suggested before 
they can be considered in terms of the rules and the observations. The 
suggestion is always a matter of imagination or intuition, and no general 
rules can be given for it. We do not assert that any suggested hypo- 
thesis is right, or that it is wrong; it may appear that there are cases 
where only one is available, but any hypothesis specific enough to give 
inferences has at least one contradictory, in comparison with which it 
may be considered. The evaluation of the likelihood requires us to 
regard the hypotheses as considered propositions, not as asserted pro- 
positions; we can give a definite value to P(p \ q) irrespective of whether 
q is true or not. This distinction is necessary, because we must be able 
to consider the consequences of false hypotheses before we can say that 
they are false.f We get no evidence for a hypothesis by merely working 
out its consequences and showing that they agree with some observa- 
tions, because it may happen that a wide range of other hypotheses 
would agree with those observations equally well. To get evidence for 
it we must also examine its various contradictories and show that they 
do not fit the observations. This elementary principle is often over- 
looked in alleged scientific work, which proceeds by stating a hypo- 
thesis, quoting masses of results of observation that might be expected 
on that hypothesis and possibly on several contradictory ones, ignoring 
all that would not be expected on it, but might be expected on some 
alternative, and claiming that the observations support the hypothesis. 
Most of the current presentations of the theory of relativity (the essen- 
tials of which are supported by observation) are of this type; so are those 
of the theory of continental drift (the hypotheses of which are contra- 
dicted by every other check that has been applied). So long as alter- 
natives are not examined and compared with the whole of the relevant 
data, a hypothesis can never be more than a considered one. 

In general the probability of an empirical proposition is subject to 
some considered hypothesis, which usually involves a number of quanti- 
tative parameters. Besides this, the general principles of the theory 
and of pure mathematics will be part of the data. It is convenient to 
have a summary notation for the set of propositions accepted throughout 
an investigation ; I shall use H to denote it. H will include the specifica- 
tion of the conditions of an observation. 0 will often be used to denote 
the observational data. 

t This is the reason for rejecting the Principia definition of implication, which leads 
to the proposition, ‘If gr is false, then q implies p.' Thus any observational result p could 
be regarded as confirming a false hypothesis q. In terms of entaihnent the corresponding 
proposition, ‘ If g is false, g entails p does not hold irrespective of p. 



§ 2.1 


DIRECT PROBABILITIES 


49 


2.1. Sampling. Suppose that we have a population, composed of 
members of two types tf) and ~ in known numbers. A sample of given 
number is drawn in such a way that any set of that number in the 
population is equally likely to be taken. What, on these data, is the 
probability that the numbers of the two types will have a given pair 
of values ? 

Let r and s be the numbers of types <j) and ■ — - ^ in the population, 
I and m those in the sample. The number of possible samples, subject 
to the conditions, is the number of ways of choosing l-\-m things from 
r+s, which we denote by The number of them that will have 

precisely I things of type <{> and m of type ~ is Now on data 

H any two particular samples are exclusive alternatives and are equally 
probable; and some sample of total number l-\-m must occur. Hence 
the probability that anj' particular sample will occur is and 

the probabihty that the actual numbers will be I and m is obtained, 
by the addition rule, by multiplying this by the total number of samples 
with these numbers. Hence 

= ( 1 ) 

It is an easy algebraic exercise to verify that the sum of aU these ex- 
pressions for different values of remaining the same, is 1. 

Explicit statement of the data H is desirable because it may be true 
in some cases that all samples are posisible but not equally probable. 
In such cases the application of the rule may lead to results that are 
seriously wrong. To obtain a genuine random sample involves indeed 
a difficult technique. Yule and Kendall give examples of the dangers 
of supposing that a sample taken without any particular thought is 
a random sample. They are all rather more complicated than this 
problem. But the following would illustrate the point. Suppose that 
we want to know the general opinion of British adults on a political 
question. The most thorough method would be a referendum to the entire 
electorate. But a newspaper may attempt to find it by means of a vote 
among its readers. These will include many regular subscribers, and 
also many casual purchasers. It is possible that on a given day any 
individual might obtain the paper — even if it was only because all the 
others were sold out. Thus all the conditions in H are satisfied, except 
that of randomness; because on the day when the voting-papers are 
issued there is not an equal chance of a regular subscriber and an occa- 
sional purchaser obtaining that particular number of the paper. The 
tendency of such a vote would therefore be to give an excess chance 



60 


DIRECT PROBABILITIES 


Chap. II 


of a sample containing a disproportionately high number of regular 
subscribers, who would presumably be more in sympathy with the 
general policy of the paper than the bulk of the population. 

2 . 11 . Another type of sampling, which is extensively discussed in 
the literature, is known as sampling with replacement. In this case 
every member, after being examined, is replaced before the next draw. 
At each stage every member, whether previously examined or not, is 
taken to be equally hkely to be drawn at any particular draw. This is 
not true in simple samphng, because a member already examined cannot 
be drawn at the next draw. If r and s as before are the numbers of the 
types in the population, the chance at any draw of a member of the 
first type being drawn, given the results of all the previous draws, will 
always be rj{r-\-s), and that of one of the second type sj{r-\-s). This 
problem is a specimen of the cases where the probabilities reduce to 
chances. 

Many other actual cases are chances or approximate to them. Thus 
the probabUities that a coin will throw a head, or a die a 6, appear to 
be chances, as far as we can tell at present. This may not be strictly 
true, however, since either, if thrown a sufficient number of times, 
would in general wear unevenly, and the probability of a head or a 
six on the next throw, given all previous throws, would depend partly 
on the amount of this wear, which could be estimated by considering 
the previous throws. Thus it would not be a chance. The existence of 
chances in these cases would not assert that the chance of a head is \ 
or that of a six the latter indeed seems to be untrue, though it is 
near enough for most practical purposes. 

If the chance of an event of the first type (which we may now call 
a success) is x, and that of one of the second, which we shall call a failure, 
is 1—x = y, then the joint probability that l-\-m trials will give just I 
successes and m failures, in any prescribed order, is But there 

will be ways of assigning the I successes to possible positions in the 
series, and these are all equally probable. Hence in this case 

( 2 ) 

l\ to ! 


which is a typical term in the binomial expression for (a;+//)'+”‘. Hence 
this law is usually known as the binomial distribution. In the case of 
sampling with replacement it becomes 


P(Z, TO I H) 


(i+TO)!/ r y/ s 

Z!to! \r+s/\r+s/ 


( 3 ) 



DIRECT PROBABILITIES 


51 


52.1 


It is easy to verify that with either t3rpe of sampling the most probable 
value of I is within one unit of r(Z+m)/(r+a), so that the ratio of the 
types in the sample is approximately the ratio in the population sampled. 
This may be expressed by saying that in the conditions of random 
sampling or sampling with replacement the most probable sample is a 
fair one. It can also be shown easily that if we consider in succession 
larger and larger populations sampled, the size of the sample always 
remaining the same, but r and s tending to infinity in such a way that r/« 
tends to a fixed value xjy, the formula for simple sampling tends to the 
binomial one. What this means is that if the population is sufficiently 
large compared with the sample, the extraction of the sample makes 
a negligible difference to the probability at the next trial, which can 
therefore be regarded as a chance with sufficient accuracy. 

2.12. Consider now what happens to the binomial law when I and 
m are large and x fixed. Let us put 


1 

m 


_ {l+m)\ 
l\m\ ^ '■ 


(4) 


l+m — n\ I = nx+n^l’^ot; m = ny—n^^cx, (5) 


and suppose that a is not large. Then 


^ogf(l) = log/!+logm!— logn!— /loga:— mlogy. (6) 

Now we have Stirling’s formulaf 

lognl = (n+|)logw-w+ilog27T + y^ — (7) 

Substituting and neglecting terms of order 1 /I, 1 jm, we have 

log/(0 = ^ log + Hog ^ + m log ^ . (8) 

n nx ny 


t The closeness of Stirling’s approximation, even if l/12n is neglected, is remarkable. 
Thus for n = 1 and 2 it gives 

1! = 0-9221; 21 = 1-9190; 

while if the term in l/12n is kept it gives 

11 = 1-0022; 21 = 2-0006. 

Considered as approximations on the hypothesis that 1 and 2 are large numbers they are 
very creditable. The use of the logarithmic series may lead to larger errors. 

Proofs of the formula and of other properties of the factorial function, not restricted to 
integral argument, are given in H. and B. S. Jeffreys, Methods of Mathemcdical Physics, 
Chapter 16. 



52 


DIRECT PROBABILITIES 


Chap. II 


Now substituting for I and m, and expanding the logarithms to order 
a* we have 

log/(Z) = i log(2-rmxy) + ^ (9) 

2xy 


1 

W) 


1 


2 2nxy I 


(10) 


{^Tmxyf!^ 

This form is due to De Moivre.f From inspection of the terms neglected 
we see that this will be a good approximation if I and m are large and 
a not large compared with or Also if nxy is large the chance 
varies little between consecutive values of I, and the sum over a range 
of values may be closely replaced by an integral, which will be valid as 
an approximation till l—nx is more than a few times (nxiyfl^. But the 
integi'and falls off with l—nx so rapidly that the integral over the range 
where (10) is valid is practically 1, and therefore includes nearly all the 
chance. But the whole probability of all values of / is 1 . It follows that 
nearly the whole probability of values of I is concentrated in a range 
such that (10) is a good approximation to (4). 

It follows further that if we choose any two positive numbers j8 and 
y, and consider the probability that I will lie between n{x+^) and 
n(x—y), it will be approximately 


-y 


which, if /S and y remain fixed, will tend to 1 as w tends to infinity. 
That is, the probability that {l—nx)ln will lie within any specified limits, 
however close, provided that they are of opposite signs, will tend to 
certainty. 

2 . 13 . This theorem was given by James Bernoulli in the Ars Conje- 
ctandi (1713). It is sometimes known as the law of averages or the law 
of large numbers. It is an important theorem, though it has often 
been misinterpreted. We must notice that it does not prove that the 
ratio Ijn mil tend to limit x when n tends to infinity. It proves that, 
subject to the probability at every trial remaining the same, however 
many trials we make, and whatever the results of previous trials, we 
may reasonably expect that IJn—x will lie within any specified range 
about 0 for any particular value of n greater than some assignable one 
depending on this range. The lai^er n is, the more closely will this 
probability approach to certainty, tending to 1 in the limit. The 

■f Miscellanea Analytica, 1733. 



§ 2.1 


DIRECT PROBABILITIES 


53 


existence of a limit for Ijn would require that there shall be a series of 
positive numbers a„, depending on n and tending to 0 as « -> oo, such 
that, for all values of n greater than some specified n^, Ijn—x lies 
between ±“ 71 - Buf if cannot be proved mathematically that such series 
always exist when the sampling is random. Indeed we can produce 
possible results of. random sampling where they do not exist. Suppose 
that X = It is essential to the notion of randomness that the results 
of previous trials are irrelevant to the next. Consequently we can never 
say at any definite stage that a particular result is out of the question. 
Thus if we enter 1 for each success and 0 for each failime such series as 
the following could arise: 

1001 10010100100111010 ..., 

100100100100100100100 ..., 
000000000000000000000 ..., 

111111111111111111111 ..., 

10110000111111110000000000 .... 

The first series was obtained by tossing a coin. The others were 
systematically designed; but it is impossible to say logically at any 
stage that the conditions of the problem forbid the alternative chosen. 
They are all possible results of random sampling consistent with a 
chance But the second would give limit the third and fourth 
limits 0 and 1 ; the fifth would give no limit at all, the ratio IJn oscil- 
lating between J and 3 . (The rule adopted for this is that the number 
of zeros or units in each block is equal to the whole number of figures 
before the beginning of the block.) An infinite number of series could 
be chosen that would all be possible results of random selection, 
assuming an infinite number of random selections possible at all, and 
giving either a hmit different from J or no hmit. 

It was proved by Wrinch and me,t and another version of the proof 
is given by M. S. Bartlett, J that if we take & fixed a. independent of n, 
Wq can alw'ays be chosen so that the probability that there will be no 
deviation numerically greater than a, for any n greater than n^, is as 
near 1 as we like. But since the required tends to infinity as a tends 
to 0 , we have the phenomenon of convergence with infinite slowness 
that led to the introduction of the notion of uniform convergence. It 
is necessary, to prove the convergence of the series, that shall tend 
to 0 ; it must not be independent of n, otherwise the ratio might oscillate 
finitely for ever. 


t Phil. Mag. 38, 1919, 718-19. 


t Ptoc. Roy. Soc. A, 141, 1933, 520-1. 



64 


DIRECT PROBABILITIES 


Chap. II 


Before considering this further we need a pair of bounds for the 
incomplete factorial function. 


7 = J du, 

X 

where x is large. Then 

00 

7 > a;" J e-*^du = 

X 

Also, if ii = x4-v, , 

ujx < expvjx, 

ec 

I < a:"e-^ J exp |— dv = ~ 


I ~^X 


nix 


Hence, if xjn is large, 


7 = 


^.ng-ter 


t 


1+01- 


(1) 

( 2 ) 

(3) 

(1) 


Now let P(n) be the chance of a ratio in n trials outside the range 
a:+a!. This is asymptotically 




27ra;(l— a;)j 
i\V2 1 i 

I 


= px(i-x) yi^i 

\ 7771 

by putting = u and applying (4). 
Now take a„ = n-\ 


noL‘‘ 

noL' 


da. 


2x(l—x) 


{l + 0(w-'/'i)} (5) 


( 6 ) 


The total chance that there will be a deviation greater than a„, for some 
n greater than n„, is less than the sum of the chances for the separate n, 
since the alternatives are not exclusive. Hence this chance 


Put 

then 


n = 


0(„.) < I 

Vno 


< 


2(237(1 


TT 


v» 


Wq exp 


1 

. 2x(l— x)J 


( 8 ) 



§ 2.1 


DIRECT PROBABILITIES 


65 


with a correcting term small compared with the first for large n^. Hence 
Q{n^) does tend to zero as n„ tends to infinity, and we have the result 
that tiq can be fixed so that the total chance of deviations greater than 
fi-Vi for all n greater than is as small as we please; and if all deviations 
are less than the series converges. Hence it may be expected, with 
an arbitrarily close approach to certainty, that subject to the conditions 
of random samphng the ratio in the series will tend to x as a limit, f 

This, however, is still a probability theorem and not a mathematically 
proved one; the mathematical theorem, that the hmit must exist in 
any case, is false because exceptions that are possible in the conditions 
of random sampling can be stated. 

The situation is that tlie proposition that the ratio does not tend to 
limit X has probability 0 in the conditions stated. This, however, does 
not entail that it will tend to this limit. We have seen (1) that series 
such that the ratio docs not tend to limit x are possible in the conditions 
of the problem, (2) that though a proposition impossible on the data 
must have probability 0 on those data, the converse is not true; a 
proposition can have probability 0 and yet be possible in much simpler 
cases than this, if we maintain Axiom 5, that probabilities on given 
data form a set of not higher ordinal type than the continuum. If a 
magnitude, Umited to a continuous set of positive values, is less than 
any assignable positive quantity, then it is 0. But this is not a contra- 
diction because the converse of Theorem 2 is false. We need only 
distinguish between propositions logically contradicted by the data, 
in which case the impossibility can be proved by the methods of deduc- 
tive logic, and propositions possible on the data but whose probability 
is zero, such as that a quantity with a uniform distribution of its prob- 
ability between 0 and 1 is exactly 

The result is not of much practical importance; we never have to 
count an infinite series empirically given, and though we might like 
to make inferences about such series we must remember the condition 
required by Bernoulli’s theorem, that no number of trials, however 
large, can possibly tell us anything about their immediate successor 
that we did not know at the outset. It seems that in physical conditions 
something analogous to the wear of a coin would always violate this 
condition. Consequently it appears that the problem could never arise. 
Further, there is a logical difficulty about whether the limit of a ratio 

t Another proof is given by F. P. Cantelli, Bend. d. drc. fnatem., Palermo, 41 , 1916, 
191-201 ; Rend. d. R. Acad. d. Lined, 26 , 1917, 39-46. See E. C. FieUer, J. R. Slot. Soc. 
99 , 1936, 717. 



56 


DIRECT PROBABILITIES 


Chap. II 


in a random series has any meaning at all. In the infinite series con- 
sidered in mathematics a law connecting the terms is always given, and 
the sum of any number of terms can be calculated by simply following 
rules stated at the start. If no such law is given, which is the essence 
of a random process, there is no means of calculation. The difficulty 
is associated with what is called the Multiplicative Axiom; this asserts 
that such a rule always exists, but it has not been proved from the 
other 'axioms of mathematical logic, though it has recently been 
proved by Godel to be consistent with them. Littlewoodf remarks, 
‘Refiection makes the intuition of its truth doubtful, analysing it 
into prejudices derived from the finite case, and short of intuition 
there seems to be nothing in its favour.’ The physical difficulty may 
arise in a finite number of trials, so that there is no objection to sup- 
posing that it may arise in any case even if the Multiplicative Axiom 
is true. In fact I should say that the notion of chance is never more than 
a considered hypothesis that we are at full liberty to reject. Its useful- 
ness is not that chances ever exist, but that it is sufficiently precisely 
stated to lead to inferences definite enough to be tested, and when it is 
found wrong we shall in the process find out how much it is wrong. 

2 . 14 . We can use the actual formula 2.12 (10) to obtain an approxi- 
mation to the formula for simple sampling when I, m, r—l, and s—m. 
are all large. Consider the expression 

F = X ( 1 ) 

where z and y are two arbitrary numbers subject to x-\-y =1. r, s, 
and l-\-m are fixed. Choose x so that the maxima of the two expres- 
sions multiplied are at the same value of I, and call this value Ig and 
the corresponding value of m, tiiq. Then 

Ig = rz; r—lg = ry, mg = sx; s—mg = sy; (2) 

whence {r-\-s)x = Ig+mg = l+m. (3) 

Then, by 2.12 (10), 

= (4) 

Also G -= == {27r{r+s}xy}-\ (6) 

■j* Elements of the Theory of Heal Functions, 1926, p. 25. 



§ 2.1 

Hence by division 


DIRECT PROBABILITIES 


67 


But 


P{l,m\H) = 



{l—lo)Hr+8) \ 
2rsxy (' 


(r-j-s^xy = (l-f-m)(r-{-s—l—m). 


( 6 ) 

( 7 ) 


whence 

where Ig = 

r+s 


(;_g2(^+5)3 


-s)® 

— m)/’ 


2rs(l-^m)(r-i-s 


(8) 

(9) 


Comparing this with 2.12 (10) we see that it is of similar form, and the 
same considerations about the treatment of the tail will apply. If r 
and s are very large compared with I and m, we can write 


r = (r+s)p; s = (r+s)q, (10) 


p and q now corresponding to the x and y of the binomial law; and the 
result approximates to 


^ ) ’'%xp r - 

2iT(l+m)pql 2(l-j-m)pq 


( 11 ) 


which is equivalent to 2. 1 2 ( 1 0). In this form we see that the probabilities 
of different compositions of the sample depend only on the sample and 
on the ratio of the type numbers in the population sampled; provided 
that the population is large compared with the sample, further informa- 
tion about its size is practically irrelevant. But in general, on account 
of the factor (r-)-6)/(r-)-s— (— m) in the exponent, the probability will 
be somewhat more closely concentrated about the maximum than for 
the corresponding binomial. This represents the effect of the with- 
drawal of the first parts of the sample on the probabilities of the later 
parts, which will have a tendency to correct any departure from fairness 
in the earlier ones. 

2.15. Multiple sampling and the multinomial la^v. These are 
straightforward extensions of the laws for simple sampling and the 
binomial law. In the first case, the population consists of p different 

types instead of two, the numbers being r^, r^; the corresponding 

numbers in the sample are Wj, n^,—, with a prescribed total. It is 
supposed as before that all possible samples of the given total number 
are equally probable. The result is 


( 1 ) 



68 


DIRECT PROBABILITIES 


Chap. II 


In the second case, the chances of the respective types occurring at 
any trial are Z 2 ,Z 2 ,...,Xp (their total being 1) and the number of trials 
2 n is prescribed. The result is 


|£r) = 


%!W2!...TCp 




It is easy to verify in (1) that the most probable set of values of the n ’s 
are nearly in the ratios of the r’s, and in (2) that the most probable 
set are nearly in the ratios of the a:’s. Consequently we may in both 
cases speak of the expected or calculated values; if is the prescribed 
total number of the sample, the expected for multiple sampling will 
be r, and the expected for the multinomial will be The 

probabihty will, however, in both cases be spread over a range about the 
most probable values, and we shall need to attend later to the question 
of how great a departure from the most probable values, on the hypo- 
thesis we are considering, can be tolerated before we can say that there 
is evidence against the hypothesis. 

2.16. The Poisson law.f We have seen that the use of Stirling's 
formula in the approximation used for the binomial law involves the 
neglect of terms of order l/l and 1/m, while the result shows that there 
is a considerable probability of departures of I from nx of amounts of 
order (nxyf^^. If then {nxyf’- > nx, the result shows that 1 = 0 is a 
very probable value, and the approximation must fail. But if n is 
large, this condition implies that x is small enough for nx to be less 
than 1. Special attention is therefore needed to cases where n is large 
but nx moderate. We take the binomial law in the form 


Now \og{n\j{n—l)\\ = l\ogn-\-0(l^jn). (2) 

Also, since x is small, (1— ar)””* = (3) 

nearly; whence, so long as Pjn and lx are small, 

P{l\H) = ^^e-r^. (4) 


The sum of this for all values of I is unity, the terms being e-”® times the 
terms of the expansion of e"®. The formula is the limit of the binomial 
when n tends to infinity and x to 0, but nx to a definite value. If nx^ 
is small but nx large, both approximations to the binomial are valid. 

The condition for the Poisson law is that there shall be a small chance 


t S. D. Poisson, itecherches sur la probabiliti des jiigements, 1837, pp. 205-7. 



§ 2.1 


DIRECT PROBABILITIES 


69 


of an event in any one trial, but there are so many trials that there is 
an appreciable probability that the event will occur in some of them. 
One of the best-known cases is the study of von Bortkiewicz on the 
number of men killed by the kick of a horse in certain Prussian army 
corps in twenty years. The unit being one army corps for one year, the 
data for fourteen corps for twenty years gave the following summary, f 


Number of deaths j 

Number of units 

Expected 

0 

144 

1390 

1 

91 

97-3 

2 

32 

34- 1 

3 

11 

80 

4 

2 

1-4 

6 and more j 

0 

0-2 


The analysis here would be that the chance of any one man being killed 
by a horse in a year is small, but the number of men in an army corps 
is such that the chance that there will be one man killed in an entire 
corps is appreciable. The probabilities that there will be 0, 1, 2,... 
men killed in a corps in a year are therefore given by the Poisson rule; 
and then by the multinomial rule, in a sample of 280 units, we should 
expect the observed numbers to be in approximately the ratios of these 
probabilities. The column headed ‘expected’ gives the expectations 
on the hypothesis that nx = 0-70. They have been recalculated, the 
calculated values as quoted having been derived from several Poisson 
laws superposed. 

Another instance is radioactive disintegration. The chance of a 
particular atom of a radioactive element breaking up in a given interval 
may be very small; but a specimen of the substance may contain 
something of the order of lO’*® atoms, and the chance that some of them 
may break up is appreciable. The following table, due to Rutherford 
and Geiger,J gives the observed and expected numbers of intervals of 
^ minute when 0, 1, 2,... a-partieles were ejected by a specimen. 

Number 0 1 2 3 4 5 6 7 8 9 10 II 12 13 14 

Obs. 67 203 383 526 632 408 273 J39 45 27 10 4 0 1 1 

Exp. 64 211 407 626 508 393 264 140 68 29 11 4 1 0 0 

0-E -1-3 -8 -24 0 -t-24 -fl6 -1-19 -1 -23 -2 -1 0 -1 -|-1 +1 

■nx is taken as the total number of particles divided by the total number 
of intervals = 10097/2608 = 3’87. It is clear that the Poisson law 
agrees with the observed variation within about one-twentieth of its 
range; a closer check will be given later. 

t von Bortkiewicz, Das Oesetz d. kleinen Zahlen, 1898. Quoted by Keynes, p. 402. 

t Rutherford, H. Geiger, and H. Bateman, Phil. Mag. 20, 1910, 698-707. 





60 


DIRECT PROBABILITIES 


Chap. II 


The Aitken dust-counter provides an example from meteorology.f 
The problem is to estimate the number of dust nuclei in the air. A 
known volume of air is admitted into a chamber containing moisture 
and filtered air, and is then made to expand. This causes condensation 
to take place on the nuclei. The drops in a small volume fall on to a 
stage and are counted. Here the large number is the number of nuclei 
in the chamber, the small chance is the chance that any particular one 
will be within the small volume at the moment of sampling. Scrase 
gives the following values. 


Number 

0 

1 

2 

3 

4 

5 

6 

7 

8 

Oba. 

23 

66 

88 

96 

73 

40 

17 

6 

3 

Exp. 

26 

66 

88 

82 

61 

38 

21 

10 

4 

0-E 

-2 

-9 

0 

-t-13 

-1-12 

+ 2 

-4 

-6 

-1 


The data are not homogeneous, the observations having been made on 
twenty different days; nxwas estimated separately for each and the separ- 
ate expectations were calculated and added. It appears that the method 
gives a fair representation of the observed counts, though there are signs 
of a systematic departure. Scrase suggests that in some cases zero counts 
may have been wrongly rejected under the impression thattlie instrument 
was not working. This would lead to an overestimate of 7ix on some days, 
therefore to an overestimate of the expectations for large numbers, and 
therefore to negative residuals at the right of the table. Mr. Diananda 
points out that the observed counts agree quite well with nx = 2-925. 

2.2. The normal law of error. Let us suppose that a quantity that 
we are trying to measure is equal to A, but that there are various pos- 
sible disturbances, n in number, each of which in any particular case 
haa equal chances ^ of producing alterations in the actual measure; 
the sign of the contribution from each is independent of those of the 
others. This is a case of the binomial law. If I of the components in an 
individual observation are positive and the remaining n—l negative, 
the measured value will be 

X = X-\-h—{n—l)e = X-\-{2l—n)e. (1) 

The possible measured values will then differ from X—ne by even 
multiples of e. We suppose n large. Then the probabilities of different 
values of I are distributed according to the law obtained by putting 
X = y = i in 2.12 (10), namely, 

t John Aitken, Proc. Roy. Soc. Edin. 16 , 1888, 136-72; F. J. Scrase, Q.J.R. Met. Soc. 
61 , 1936, 368-78. 



§ 2.2 


DIRECT PROBABILITIES 


61 


and the probability thal; I will be equal to (> ^i), or some inter- 
mediate value will be 

Pik > ; > = I; (A) (3) 

l^h 

But this is the probability that the measure x will be in the range from 
X-\- {2lj.—n)e to inclusive. If, then, we consider a range 

to Xg, long enough to include many possible values of I, we can 
replace the sum by an integral, write 

l—\n = (a;— A)/2e, (4) 

and P{x, ^x^x,\H)=^ (Aj"%xpj- (6) 

This range will contain {x 2 —Xj)j 2 €-\-\ admissible values of x. Now 
suppose that x^—x^, which is much larger than e, is also much less than 
eVn. The sum will then approximate to 

?/2V/^ { (x-X)^\dx 

Xi 

Now let n be very large and e very small, in such a way that eVn is finite. 
The possible values of x will then become indefinitely closely packed, 
and if we now consider a small range from x^ to Xi+Sx, the chance that 
X lies within it will approximate to 

This is an instance of the normal law, which we can write in its general 

or, more briefly, 

in the sense that when dx tends to zero the ratio of the two sides tends 


to 1. In practice we are always concerned with finite ranges, so that 
strictly we always require the integrals of these expressions over some 
finite range, and the transition from Sx to dx involves only a step that 
we shall always undo before we make any use of the results. 

It will be noticed that whereas we started with three parameters A, 
n, and c, in the result we are left with two, A and eVn, the latter being 
replaced by a. This is similar to what happens in sampling, where the 



62 


DIRECT PROBABILITIES 


Chap. II 


size of the population sampled becomes irrelevant when it is large. The 
form of the normal law, in application to errors, seems to have been 
given first by Laplace in 1783, though it is usually attributed to Gauss.t 
The law can also be written 

P{x^ < X < Xi+dx I II) = -^exp{— A)®} dx, (9) 

VTT 

where 2A®<r® =1. (10) 

<7 is usually called the standard error, but sometimes the mean square 
error or simply the mean error, h is called the precision constant. If 
we introduce the error function 

X 

erfx — ~ ^ e-'* dt, (11) 

0 

the probability that x wiU be less than x^ is ^{l-(-erf A(xi— A)}. Tables 
of the probability that x — A will be less than given multiples of a are 
given by Sheppard and by later writers. The error function, which has 
other applications in heat conduction and diffusion, is tabulated by 
Milne-Thomson and Comrie. In statistical applications (8) is more 
convenient than (11), since a usually arises more directly than h. The 
curve ycc exp{— (x— A)®/2ff®} has inflexions at A±cr. There is a prob- 
ability 0'683 that an observation will lie between A^ct. There is a 
probability ^ that it will lie between A±0-6746(7. In this sense 0'6746o- 
is often called the probable error, and is the uncertainty usually quoted 
in astronomical and physical works. This practice would be better 
abandoned. In applying any significance test or the or t rules what 
arises is a, and if imcertainties are given in terms of the probable error, 
the multiplication must first be undone, with imnecessary trouble and 
some loss of accuracy due to accumulation of rounding-off errors. 

The conditions contemplated in the normal law of error have often a 
rough justification. In many cases we have adequate reason to suppose 
that the quantity we are trying to measure has a ‘true value’, though 
we must reserve a further discussion of what that can mean in relation 
to our general theory. But several minor disturbances may affect any 
individual measure, such as wandering of the observer’s attention, the 
fact that he must round off his measures to the nearest multiple or 
tenth of the scale interval, disturbance of the apparatus through vibra- 
tion of the groimd or wind, and so on. These can often be regarded as 
independent. They are not in general capable of producing only two 
t Pearson, Biometrika, 13, 1920, 25. 



§ 2.2 


DIRECT PROBABILITIES 


63 


equal and opposite values of the disturbance; most of them are capable 
of a continuous range of values, and in general there is not much reason 
to suppose that these are equally spread for all the disturbances. The 
general application of the above argument must therefore be mistrusted. 
It can be regarded only as an indication that there may be cases where 
the chance of error is distributed according to the normal law, which 
sums up the whole information with regard to the possible variation in 
two parameters A and <t. X is also often called the population mean and 
a the population standard deviation. The latter term is rather cumbrous, 
and if the word ‘population’ is omitted it is liable to be confused with 
the standard deviation of a given finite set of observations, which is not 
the same thing. 

Where we are dealing with a law of the form 

P(dx\H)=f^^^, 

of which the normal law is an instance, we may speak of A as the 
location parameter and a as the scale parameter, to use Fisher’s terms. 
These correspond to epistemological needs better than ‘true value’ and 
‘standard error’ do. But the latter terms are convenient; we have only 
to remember that ‘true value’ is not to be understood in an absolute 
sense, but in the sense that any law relating measures, if it is to be of 
any use, must be clearly stated, in probability terms, and that a possible 
way of progress (apparently the only possible way) is to treat the 
variation as the resultant of a pai^t that would be exactly predictable, 
given exact statements of the values of certain parameters, and a 
random error. The law in its naive form would deal only with the 
former part. The parameters in this part may be called the true values 
of the parameters, and the observed values that they would lead to if 
the random part was neglected the true values. The actual observed 
values will differ somewhat. By the principle of inverse probability we 
shall be able then to proceed from the observations to estimates of the 
true values of the parameters, which, however, will not be exact deter- 
minations, but will have ranges of uncertainty corresponding to the fact 
that the individual random errors in the observations are not definitely 
known. 

In actual fact there are some cases where the normal law of error 
appears to represent the outstanding variation as well as we can tell. 
There are others where, though we find that it is probably incorrect 
when we study a sufficient number of observations, this number is 



64 


DIRECT PROBABILITIES 


Chap. II 


large, of the order of 500, and the use of the normal law in such cases 
as if it was correct would not lead to serious mistakes. There are others 
where it is glaringly wrong, and the only proper treatment is to obtain 
a sufficient number of observations to give us some idea of what the 
corresponding distribution' of chance can be. Meanwhile we shall con- 
sider an important series of generalized laws of error. 


2.3. The Pearson laws. If we write the normal law of error in the 
form 

where we have now made the parameters A and a explicit (they were 
formerly understood in H), we see that it is an instance of the general 

P{dx\H) = ydx, (2) 

where y ^ 0 and the integral of y over all possible values must be 1. 
In this case we find easily 

(3) 

y dx (7® 


The law, therefore, has the properties that dyjdx vanishes in the limit 
when y tends to 0, and at one intermediate value of x, namely, A. If we 
consider the generalized form 

1 dy _ x-a 

y dx bff-\-biX->rb 2 x'^’ 


the same will usually hold, but we have two more parameters and shall 
be able to represent laws of a much wider range of form. They will have 
one point where y is stationary; if the range of x is infinite y and dyjdx 
will tend to zero at the end or ends; if the range is limited in one or 
both directions there will still be cases where this holds. The integral 
of (4) can in general be written in the form 

y = Ci)“>(c 2 — (5) 

where A will be fixed by the condition that the integral of y is 1, and 
Cj and Cj are the zeros of the denominator in (4). There are three main 
types of solution and a number of transitional and degenerate cases. 

1. Cl and Cg imaginary. Then they must be conjugate complexes, 
and for y to be real and must also be conjugate complexes. 
y cannot vanish or become infinite for any real value of x, and the 
admissible values of x range from — oo to -hoo, with a maximum of y 



§2.3 


DIRECT PROBABILITIES 


65 


4i 

at some intermediate value. Forms with one maximum are designated 
bell-shaped by Pearson. We may write these laws in the forms 




' 277(2m-2)! 


-iq)\ 


{(x— A)2+j32}-™ X 


X exp 


2gtan-i^^j. (6) 


These are Pearson’s Type IV. In general they are asymmetrical or 
skew, but if gi = 0 they reduce to the sjunmetrical form 


y = 


j(m— 1)! (m— 1)! 
27r(2m— 2)! 


{{x-A) 2 +j 82 }- 


( 7 ) 




,_i (m-1)! 


7r‘/2(m— I)! 


-{(x-A)Hi8*)-’", 


(8) 


which is Pearson’s Type VII. In both cases m must be greater than \ 
for convergence. These laws resemble the normal law in having an 
infinite range of x in both directions, which is true of no other Pearson 
type, but y falls off less rapidly. With the normal law the expectation 
of any power of x is finite; with Type VII that of any even power 
equal to 2m— 1 or more is infinite (m need not be integral); with Type IV 
expectations of odd powers > 2m— 1 are also infinite. This is a useful 
property in representing errors of measurement, since it is usually foimd, 
when sufficient observations are available, that there are more outlying 
large residuals than the normal law would suggest. The fact that these 
laws, like the normal law, give a non-zero chance of an error greater than 
any finite amount is an apparent drawback, since we might say that 
however bad the observations are there is some limit to the error; but 
to harmonize this behef wnth the observed distributions would require 
us to go beyond the range of the Pearson types, which do give satis- 
factory agreement vuthin the ranges where observations exist. 

If Cl and Cg are real (Cg > Cj) we must distinguish three cases. (4) has 
singularities at c^ and Cg and the solution is applicable only in ranges 
that do pot include a singularity. Hence we must consider separately 
cases where the admissible values of x are less than Cj, between and 
Cg, or greater than Cg. The difference between the first and third can 
be removed by merely reversing the direction of measurement. 

2. Admissible values of x between and Cg. We can take the law in 
the form 


y = 


(mi-l-mg-f 1)! 

m^lmg! (Cg- 


(X — Ci)'">(C2 — 


( 9 ) 


F 


8695.G8 



66 


DIRECT PROBABILITIES 


Chap, n 


which will be possible if both and are greater than —1. If both 
are positive, the curve is bell-shaped. If 0 > > — 1 , i/ is infinite at 

Cj. If at the same time is positive, dyjdx is negative throughout the 
range and the curve is called %i -shaped. In this case a does not lie 
between and Cj, and is not an admissible value of x. If and 
are both negative, y is infinite at both limits and a lies between them. 
The curve is then called \J -shaped. These cases cover Pearson’s Type I. 
It will be seen that the possibility of U-shaped and J -shaped curves 
gives it greater generality than was originally attempted. 

There are several special cases: 

= mj. The law is then symmetrical. This is Pearson’s Type II. 

Further degenerations give 


nil — ^2 — This makes y uniform between and Cg, and zero 
outside that range. This is the rectangular distribution, not given 
a number by Pearson. 

nii = m^— 1. This, with a change of scale and origin, gives 2 / oc \ —x^, 
the parabolic distribution. 

nil — This is a J -shaped curve with y proportional to (Cg— a;)™* for 
x between and Cj. This is Pearson’s Type IX. It starts from a 
finite ordinate at c^. 


nil — — wia- y will be proportional to 


'X—Ci\”> . 

^ W1 


with — 1 < m < 1 . 


This is Pearson’s Type XII. The curve is always J -shaped. 


3. Admissible values of x all ^ c^. We can take the law in the form 


y = 




m. 


!(— m^— mg— 2)! (Cg— 


(X — Ci)"'>(C2 — x)” 


( 10 ) 


where for convergence mg > —1, m^-t-wig < —1. These are the laws 
of Type VI. If mg > 0 they are bell-shaped; if mg < 0, J -shaped. They 
are never U-shaped. These laws will give the kind of distribution shown 
by the times of arrival of a train; there is a concentration at values a 
little greater than Cg, values less than Cg do not occur, and there is a 
long train of large values, which may rarely occur but are serious when 
they do. 

A particular case is 

mg = 0. This makes y proportional to (x— Ci)’"* for values of x greater 
than Cg; evidently m^ < — 1. This gives Pearson’s Types VIII and 
XI, which are identical. It starts from a finite ordinate at Cg. 



DIRECT PROBABILITIES 


67 


5 2.3 

Types IV, I, and VI, to take them in what seems to me to be their 
natural order, are the only ones that involve the full number of adjust- 
able parameters, four. There are also three transitional cases between 
them. 

4. There will be a transition from Type I to Type VI expressed by 
making in I tend to -f-cx) or in VI to — oo. In either case the limiting 
form is ^ ^ (a :— (m > — 1 , a > 0). 

This is Type III. It resembles Type VI in appearance but is more 
closely concentrated to small departures from c. A particular case is 

m — 0; this is Type X, an exponential law, which can also be regarded 
as the transition between Types VIII and IX. 

6. The transition from Type VI to Type IV is the case of equal roots, 
the roots of the denominator in (4) being equal, real, and finite. Then 
we can write (4) in the form 

a . A 

y dx x—c'{x~cY’ 

whence y = A(a:— c)-“exp 

This is Type V. To give convergence at oc , a must be > 1 ; for con- 
vergence at c, /3 > 0 for any a > 1. It is always bell-shaped, since y 
must vanish at r = c. Otherwise it resembles Type VI. It differs from 
Type III in the interchange of the two types of convergence at the 
extremes; indeed, the change of {x—c) to {x—cy^ transforms one into 
the other. 

6. The transition from Type IV to Type I requires the roots to be 
^ 00 ; then and both vanish and we are back to the normal law. 

This analysis covers the range of the Pearson types, and is, I think, 
considerably shorter and more systematic than has been given pre- 
viously. My own experience with them has been rather small, though 
I have had to deal with Types II, III, VII, and VIII. For purposes of 
exposition I think it would be a great convenience if those who use 
them extensively could agree on a more systematic numbering in place 
of the present haphazard one, which places III, the transition between 
I and VI, between II, which is the symmetrical case of I, and IV, which 
is a different main type from any; and VI, a main type, between V, a 
transitional case, and VII, a degenerate case of IV. I should suggest 
the following. 




68 

DIRECT 

PROBABILITIES 


Chap. II 




Number 

Main types 

Pearson's number 

Special cases 

Pearson's 

Suggested 

1 

IV 

q = 0 

VII 

la 

2 

I 

TO, = TO, 

II 

2o 



TO, = TO, = 0 

Rect. 

26 



TO, = m, = 1 

Parab. 

2c 



TO, = 0 

IX 

2d 



TO, = — TO, 

XII 

2c 

3 

VI 

TO, = 0 

VIII 

3a 

Transitiona 





2 to 3 

III 

TO = 0 

X 

23o 

3 to 1 

V 




1 to 2 

Normal 





This covers the whole range with the exception of XI, which is a mere 
rewriting of VIII. I think that special numbers for the rectangular and 
parabolic laws are worth while as they are hkely to be at least as im- 
portant as XII in practice, and the rectangular law has great theoretical 
interest. Both, like the normal law, involve only a scale parameter and 
a location parameter. The main types involve two others. The rest 
involve three parameters in all. 

It may be remarked that Pearson distinguished Types I and VI 
according as the roots are real and of opposite sign or real and of like 
sign. This appears to make the type depend on the arbitrary position 
of the origin. The important point is whether the admissible values of 
* lie between the roots or not. In fact Pearson does make his decision 
according to the latter criterion. 

2.4. The negative binomial law. Suppose that a distribution of 
chance follows the Poisson law 

Pi]i\rH) = j^e-^ ( 1 ) 

but that r itself is unknown, having a distribution of chance given by 
the Type III law 

P{dr 1 H) = (2) 

a! 

(where, since a may be fractional, we must understand a! to be defined 

by a! = J Then 

0 

P{1, dr\H) = (3) 

l\ a! ' ' 

To get the total probability for any value of I, we must add for all 



DIRECT PROBABILITIES 


69 


possible values of r; whicfi means in this case that we must integrate. 
Then 

P(l \H) — { ^ 

^ J l\a\ 

0 

Apart from the factor coeflScient of in the 


/ 3 . \-a-l 

expansion of ^1 — I . The sum over all values of Z is 1, as it 
must be since the conditions stated are exhaustive. If we put 

5-^ = 1— «. 
iS+l 


we have 


P(Z1^) = 


which puts the negative binomial form more clearly in evidence. This 
result is due to M. Greenwood and G. U. Yule.| The immediate applica- 
tion was to problems of factory accidents. The conditions of the Poisson 
law were satisfied in respect of the total chance of an accident in a 
factory in a given period being the sum of a large number of small 
chances, but it was not clear that these chances were the same for all 
employees. The chance of a particular workman having an accident on 
a particular day, for instance, would have to be regarded as the analogue 
of X in the derivation of the Poisson law, and the number of days in the 
period considered as the analogue of n. Then for each individual the 
chances of 0, 1, 2,... accidents in the period would follow a Poisson law 
— subject to the condition that having one accident does not stimulate 
him to have another— and if the values of r = ?w: for the different work- 
men are distributed, as nearly as can be for a finite number, in a Type 
III law, the negative binomial follows as the resultant for all workmen. 

The following alternative development shows that the condition that 
the probabilities of accidents to the same workman must be independent 
is not strictly necessary. It can at any rate be replaced by other condi- 
tions. Suppose that the total number of events is recorded, but that in 
fact some of the events are composite, two or more being associated. 
These are each only one independent event, but will be coimted as two 
or more each in the totals. Let r^, be the appropriate values of r 
for the simple, double,... events in the interval considered. Each type 


t J. R. Stat. Soc. 83, 1920, 265-79. 



70 


DIRECT PROBABILITIES 


Chap. II 


separately will satisfy the Poisson rule, and the chance that there will 
be simple, double events, and so on, will be 


P(mi,m 2 ,... H) = .^ll...exp{— (ri+r24-...)}. (6) 

mj! TOg! 

The probabihty that the total number of events as counted will be m 
is the sum of these expressions, subject to 

mi-t-2»i2+3OTj+... = m. I*?) 

But this sum is the coefficient of x™ in the expansion of 


fix) = exp()-ia;+r 2 a; 2 +...— rj— j-j— ...). (8) 

Now in practice, if we have no record of the individual events, there 
will not be much hope of determining the r’s separately. But if we 
want to find a law that wiU take into account the extra complication 
we must have at least one new parameter, though there may not be 
much point in introducing more than one. Let us take the form: 


Then 


Tg = r^a^-^/s. 

logfix) = riX(l + |ax+^oV+"-)— ^i(l + ia+-) 


(9) 


= (ri/a){-log(l-ax)+log(l-a)}, 



and the coefficient of x”* is 

a\a j \a jmi 


( 10 ) 

( 11 ) 

( 12 ) 


which again is a negative binomial law, with Tj/u replacing the ot-f- 1 of 
Greenwood and Yule’s derivation. f 

It is convenient to take the law in the form 


P{m \ r,n,H) 


/ n 

"w(»-|-l)"-(^+™— 1) 


\n+r} 

to! 

\n+rj 


( 13 ) 


When w 00 this tends to the Poisson law with parameter r. We shall 
see later that it has other advantages. The series converges for all 
positive n. The expectations of m and to(w— 1) are r and (14-l/w)r®. 
That of im—rY is r-\-r^jn. When w -> 0, aU the chances of non-zero m 
tend to 0, while that of m being zero tends to 1. In the latter case as 
we approach the limit, keeping r fixed, the chances of m become more 
and more widely spread to wide values, and the concentration at 0 is 
needed to keep the total expectation equal to r. Thus the negative 


t This derivation has already been given by R. Lviders, Biometrika, 26, 1934, 108-28. 



DIRECT PROBABILITIES 


71 


5 2A 

binomial law, for small n, will resemble the distribution of the scores 
of a first-class cricket or billiards player, whose commonest score may 
be 0 though his average is about 60. On the Poisson law the commonest 
score and the average should approximately agree, and the chance of 
a score of 1 would be 60 times that of a score 0. 

Here we have a case where two different types of departure from the 
Poisson law both lead to results of the same form, and modify it in the 
same direction. If the law is nevertheless found to agree with the facts, 
it is reasonable to reject both types of departure. Thus the agreement 
of the data about deatlis from kicks of a horse in the Prussian army 
may be taken to mean both ( 1 ) that nobody can be killed twice by the 
kick of a horse, (2) that the fact that one man has been so killed does 
not indicate an extra liability for others in the same unit to be. The 
agreement in the radioactivity data would mean that (1) the chances 
of disintegration of different atoms of the same radioactive substance 
are approximately equal, (2) the disintegration of one atom does not 
lead immediately to the disintegration of another. 


2,5. Correlation. This can be treated on lines analogous to the deriva- 
tion of the normal law from the binomial. Suppose that two quantities 
X and y are to be measured simultaneously, and that there are m-fn 
independent component variations, each contributing to a: and ±j3 
to y. m of them are constrained to give the same sign in both x and y, 
n to give opposite signs. Suppose that in a particular case the number 
making positive contributions to x that give the same sign is p, the 
number giving opposite signs q. Then 

X = pa.— {m~p)a.-{-qa.— (n—q)a. = {2p — m)a.-\-{2q—n)a., (1) 

y = pp-(m~p)^~qp+{n—q)^ = [2p—m)^—{2q—n)^. (2) 

We are taking each component to be as likely as not to give a positive 
contribution to x. Then 


P{p,q\m,n,cL,^, H) 




by the previous argument. We have to transform to the observed 
variables x and y. Now 

= , 4 ) 

Remembering that p and q are capable of integral values only, and that 



72 


DIRECT PROBABILITIES 


Chap. II 


the total chance in any region must be the same whether the observation 
is expressed in terms of p and gi or of a; and y, we see that we must 
replace the sum with regard to p and q by the integral with regard to 
dxdyj^oL^. Hence 


P(dxdy I m, n, a, j3, H) = 
Now put 

Then we find 
P(dxdy I m, n, a, H) = 


dxdy 1 

L 

8m\a'' j8/ 8n 

lx y' 

I’j... 

47Ta^.^(mn) 

U A 

= T*; 

(m—n)a/3 = 

par. 

(6) 

dxdy ^ ^ ^ j 

f 1 Ix^ 

2pxy 


2TraT^J(l—p‘) '^1 


1 

OT 

(7) 


so that the four original parameters are now reduced to three, and we 
can assert that this is also equal to P{dxdy \ a, r, p, H). Of course, every- 
thing that can be said against the normal law of error for one variable 
can be said twice against this form, which is the generalization to two 
variables. But on the other hand the chief thing that can be said in 
favour of the normal law, that of all laws that are anywhere near the 
truth it is far the easiest to apply, can also be said with greater force 
of normal correlation. The new parameter p is called the correlation 
coefficient. 

The law (7) was obtained first by Sir Francis Galton empirically, by 
studying observed frequencies.f As Pearson remarks :J ‘That Galton 
should have evolved all this from his observations is to my mind one of 
the most noteworthy scientific discoveries arising from pure analysis 
of observations.’ Galton had not, at this stage, noticed that negative 
correlations exist, since he remarks: ‘Two variable organs are said to 
be correlated when the variation of one is accompanied on the average 
by more or less variation of the other, and in the same direction, ’§ and 
he speaks of correlation arising when two variations are the resultant 
of several causes, some common to both and some independent. The 
above analysis permits negative correlations. The more restricted one, 
however, is often valid and leads in particular to an account of intra- 
class correlation. 

By integration we find 


t B.A. Report, Aberdeen, 1885. 

t Biometrika, 13, 1920, 26-45. This is a most interesting historical study. 
§ Proe. Boy. 8oc. 45, 1889, 135. 



73 


§ 2.6 

Therefore 
P{dy\<T, T,p,x, H) 


DIRECT PROBABILITIES 

« 


Pjdxdy I g, r,p,H) 
P{dz\a,T,p,H) 


That is, the probability of x is normally distributed with standard error 
or, and for given x the probability of y is normally distributed about 
prx/a with standard error t^( 1— p*). The line y = prxjcr is known as 
the line of regression of y on x. Similarly the probability of y is normally 
distributed with standard error t, and that of x given y is normally 
distributed about x = payjr, the line of regression of x on y. The lines 
of regression coincide only if p = ±1. 

The expectations of x^, y^, and xy, given a, p, t, H, are respectively 
O’*, T*, par. 


2.6. The characteristic function. Suppose that on a given law the 
chance of the variable x being less than an assigned value is /(a;). Then 
the expectation of any function A(x) of x is J A(x) df{x) over the range of 
x; in which we must understand a Stieltjes integral if /(x) has discon- 
tinuities. These, if any, will aU be positive jumps. The characteristic 
function D(k) is defined as the expectation of e*®, where k is purely 
imaginary; thus 

Q{k) = je-=^df{x) ( 1 ) 

and |Q(k)| ^ 1. The integral is absolutely convergent because J df{x) 
converges. 

The integral 

r + lOO 

1 r 

— — (€-'«■— e-*'**) dw (Xj < Xj), 

27ri J K 


in which the path is a line parallel to the imaginary axis on the positive 
side, is equal to 1 if x^ < x < Xj, and zero if x < Xj or > Xg, being the 
difference of two Heaviside unit functions. If we replace the path by 
the imaginary axis, except for a small semicircle about the origin, the 
integral is unaltered. Also the integral about the small semicircle tends 
to zero in the limit and the integrand is continuous. Hence we may 
replace the path by the imaginary axis, and 



— dn 


jl (Xi < X < Xg), 

\0 (x < Xj, Xg < x). 


( 2 ) 



74 


DIRECT PROBABILITIES 


Chap, n 


Now consider the sum 

too 

— 100 

over r, the ranges from to ^r+1 being so chosen that all points of 
discontinuity of/(^) lie within them, and being some value between 
if and On integrating with regard to k, terms for not between 
and x^ vanish, while those between them contribute 

S {/(^,+l)-/(^r)} (4) 

*1 

in the limit when the intervals become indefinitely short. But the limit 
of the sum is by definition the Stieltjes integral 

GO too ice 

f ^/(^) f dK = f d/c, 

ZTTl J J K ZTTl J K 

a:5=-0Q — ioc —too 

by inverting the order of integration, which is easily shown to be vahd. 

When /(a:) is differentiable this leads to a case of Fourier’s integral 
theorem 

ioo 

-too 

Similarly, if e^, ej,..., e* are a set of variables whose chances are 
independent and follow laws given by /i(ej), /gleg),... it can be shown 
that the chance that 

x^ <C ^ Cf <C X 2 

1® ioo _ 00 .jjj. 

^ J d/C J df^...df^, 

— too — 00"' 


where the Q’s are the characteristic functions corresponding to the/’s. 
Hence the characteristic function of the sum of a set of variables 
following independent laws of chance is the product of their separate 
characteristic functions. 

The characteristic function is intimately related to the expectations 
of the powers of x, where these exist. If we write 

Mm = / x’^dfix). (8) 



§ 2.6 


DIRECT PROBABILITIES 


76 


we can call the mth moment of the law about the origin. If moments 
up to order m exist, we can differentiate (1) m times under the integral 
sign with regard to #c, and then for k = 0 

— £2(k) = fi^. (9) 

Thus, by Taylor’s theorem, 

Q(k) = (10) 

even though the complete Taylor series may not exist. For this reason 
0(/f) is also called the moment-generating function. If we take the 
origin of X at its expectation, will be 0. Inspection of (1) shows that 
decreasing aU values of x by will multiply Q(/c) by and there- 
fore if Qo('^) fhe characteristic function of x — y-y, 

QJk) = c-'^m>Q(k). 

The coefficients of «:“/«! in the expansion of logi2(fc) are called the 
semi-invariants or cumulants, when they exist, since the second and 
higher ones are independent of the origin and are additive for the sum 
of several variables. Also if y has a probability law g(y) such that 
g{y) = f(x) if y == ax, where a is constant, the characteristic function 
of g{y) is 

E{k) = J dg{y) = J df{x) ~ f2(a/f). (11) 

The moment and the semi-invariant of g(y) of order m are a”* times 
those of f{x). 

If Q(k) can be expanded in powers of k, it will follow that the series 
represents an analytic function near k = 0. But if any moment of the 
law diverges, the integral (1) defining 0(/<) will not exist for k on at 
least one side of the imaginary axis, however close to it, since the 
integral will contain a factor where c is real and not zero. Thus 
the integral will define a function only for purely imaginary values of k. 
It may be the value on the imaginary axis of some function analytic 
in the half-plane, but such a function, if it exists, will not be given off 
the axis by the integral. This applies to laws of Pearson’s Types IV, 
VII, and VI. 

The integral may exist for all real k; this applies in all cases where 
the law has a finite range, such as the binomial and Type I laws. It is 
also true for the normal law. In that case the integral will exist for all 
K and be uniformly convergent in any bounded region of the k plane. 
It can therefore be integrated under the integral sign about any contour 



76 


DIBECT PROBABILITIES 


diap. n 


in the k plane, and this integral will be 0 since J e*® d/c = 0. Hence 

c 

by Morera’s theoremf Q(#f) is an analytic function within any contour 
in the k plane, and must therefore be an integral function .J Then Q(k) 
is expansible in powers of k over the entire plane. 

There are cases where the integral exists for some complex values of 
K and not for others; for instance, the median law 

df = \ex^{—\x\la) dxja. 

Within the belt — 1/a < R{k) < I ja the integral will define an analytic 
function. Outside this belt it diverges. 

Thus we have two main types of case. If all the moments of the law 
exist and the expectations of also exist, where c is some real quantity, 
Q(k) will be analytic near 0 and the coefficient of k" will be for 
all n. If moments up to order m converge, but those of higher orders 
diverge, the integral does not define a function except for purely imagi- 
naiy values of k. Its derivatives at #c = 0 for imaginary k will give the 
moments correctly up to order m; but higher derivatives, if they exist, 
will not give the higher moments. We shall see that they do not neces- 
sarily exist. 

2.61. The characteristic function is sometimes useful for actually 
calculating the moments. Thus consider the binomial law, according to 
which the chance of a samphng number less than I is 

m = (1) 

0 

Then Q(k) = 2 ”Cyx'( 1— x)". (2) 

The coefficient of k is nx, which is therefore the expectation of 1. The 
moments about nx can then be derived by considering 

= ( 1 — 

-=l + -^xy + -^xy{y-x)+—{3nVy^+nzy{l-6xy)}+..., (3) 

where y — 1— x; whence the moments to order 4 about the mean are 
y.g=^nxy, y .3 = nxy{y—x); = 3n®xV+wx7/(l — 6xy). (4) 

t E. C. Titchmarsh, Theory 0 } Functions, 1932, p. 82. 

j I am indebted to Professor Littlewood for calling my attention to this point, in 
answer to a query. 



§2.6 DIRECT PROBABILITIES 

A 

Pearson’s parameters VjSi and /Sj are given by 


77 




ta ^ y—^ . 
(nxyfla’ 


ft 3+1=53'. 

® nxy 


l4 


( 5 ) 


VjSi and ^2 the characteristic form parameters (as distinct from those 
of location and scale) used by him in fitting his characteristic laws and 
other types of law. They are otherwise useful as a general indication 
of the features of a law. If a: < i, the positive sign of indicates the 
skewness due to the longer range on the upper side of the mean. If 
a; = ^, the law is symmetrical and jSj = 3— 2/». In the limit when n 
is large and the law tends to the normal, therefore, the fourth moment 
tends to three times the square of the second. The fact that < 3 
for the symmetrical binomial is an indication of the effect of the finite 
range. The law is lower in the middle and at the tails than the normal 
law with the same 1 x 2 - 

In (2) put z = rjn and let n tend to infinity; then the law tends to the 
Poisson form. In this case the mean of the law is r; shifting the origin 
to the mean we have 

Qo(k) = exp(r(e'^-l-«)} = expr|j^ + ^ + ...| (6) 

= l+~ + '^+{^r^+r)'^+..., ( 7 ) 

whence = r, 1^3 = r, = Sr^-f-r. (8) 

The semi-invariants are all equal to r, by (6). 

Por the negative binomial law 


O('C) 


n n{n+l)...{n+m — l)j r 

n+r) ^ m\ \n-{-rj 



~n 


( 9 ) 



n 



(10) 


The coefficient of k in the expansion is r, which is the expectation of m; 
and 


Qo(/f) 


i r(e’^-l) |-» 
n I 


( 11 ) 


logQo(«-) 




24 24 n 2n® 4n^j 


+•- 


( 12 ) 



78 


DIBECT PROBABILITIES 


Chap. II 


The second moment is therefore r-\-T^jn, as we found directly in 2.4; 
the third and fourth are 


. 3r2 2r3 

For the normal law 


= ’■+4+3+'’(s+5)+’"(I+^)' 




we find easily Q(/<:) = exp^a^/c’*). 

AU the moments converge, and 


(2m)! 

tO 


2m* 


2”^! 


M2m+l — 0- 


For the median law 


we find 


df = ^expi 


\ a } a 


f2(/c) 


1 


1— aV 


(13) 


(14) 


(15) 


(16) 


The second moment is 2a2, as we can see at once otherwise. 

For the binomial, Poisson, and normal laws all the moments exist 
and the characteristic function is an integral function. For the negative 
binomial and the median law all the moments exist, but the charac- 
teristic function has poles and is not defined over the whole k plane 
by 2.6(1). 

2.62. Consider now a case where the second moment is infinite, 
the Cauchy distribution (the Type VII law with index 1 ) 


dl 

dx 


1 


( 1 ) 


■7r(,l+X^) 

The integral for f2(/f) must be found by contour integration. When 
I{k) is positive the infinite semicircle must be taken on the positive 
side of the axis of x, and the contour encloses the pole at a: = i. When 

I(k) is negative, on the other hand, the suitable contour encloses the 

pole at —i. Thus Q(/c) has different analytic forms according to the 
sign of 1{k). They are 

■ c*'* {I{k) > 0}, (2) 

c-^'' {I(k) < 0}. (3) 

The first derivative of Q(/f) does not exist at /c = 0, and no function 
analytic in any region about k = 0 can represent il(>c). 


Q(«) = 



§2.6 DIRECT PROBABILITIES 

For the Type VII law with index 2, 

df 2 
dx 

we find similarly 

> 0}> 

\ {■^('^) < f*}- 

Derivatives to order 2 are continuous at /<r = 0, corresponding to the 
existence of the second moment. But the third derivative at k = 0 
has different values on the two sides, and fi(/c) is not the form taken by 
any function analytic in a region about «• = 0. 

2.63. The central limit theorem. The interest of 2.6 (7) lies in its 
relation to the resultant of a number of independent disturbances. In 
many cases, if the number is large, it can be shown that the chance of 
the resultant is approximately normally distributed. We may notice, 
first, that if there are two components both following the normal law 
with standard errors a and t, the respective values of Q(/() will be 
e’fe**"’ and and by 2.6 (7) the characteristic function of their sum 

is exp Hence the distribution of the chance for the sum is 

normal with standard error This can be extended to the 

composition of any number of normal errors. This principle is called 
the reproductive property of the normal law. 

If for each component €, we take the origin at the expectation of e„ 
and all the second moments about this origin are 1, we have by 2.6 (10) 


79 

(4) 

( 6 ) 

(6) 


aw = l + k"+o(/c2). (1) 

If instead we consider e^./v'i, the second moment is divided by k, and 
by 2.61 (11) 2 , 2 \ 

"'<«) = ‘+Ie+»(s) 

It is to be noticed that fl,. is a function of k/V^ and therefore the 
remainder term, for any k, is small compared with Ijk for k large. Then 

k 

the characteristic function of 2 be 

r=>l 

Q(^) = Qi(K)a(K)...a(«) = ji (3) 

if all components follow the same law. But even if they do not we shall 

iogQ(K) = 2iogaw 




(4) 



80 DIRECT PROBABILITIES Chap. II 

and the differences between the laws appear only in the terms o{K^j2k). 
If then 00 , logQ(K) -> Jk®, and in the limit 

C1{k) = exp(^/c®). (5) 

The chance that ^ between and is therefore 

too 

^ r d/C. (6) 

2m Jk 

—too 

This is differentiable, and the derivative gives the probability density 

ico 

± j exT?(W-xx) dK = ^^exp(-K)- (7) 

— ioo 

Thus the probability distribution of the sum of k component variations, 
all following independent laws of chance with second moments Ifk, 
will tend in the limit as k becomes large to the normal law with standard 
error 1. 

It is not quite obvious that if a sequence of characteristic functions 
tends to a Hmit, that hmit is the characteristic function of the limit, 
if any, of the corresponding laws. It is proved by H. Cramer that the 
convergence is uniform and therefore that the passage from (5) to (6) 
is justified. 

If the components have not all the same second moment, the result 
is not necessarily true; Whittaker and Robinsonf give a striking 
example to the contrary. 

The above argument is given, with more attention to mathematical 
detail, by H. Cram6r. J The important point is that it does not assume 
the existence of moments above the second. The derivation of the 
normal law on similar principles, given by Whittaker and Robinson 
and reproduced with minor changes in my Scientific Inference, is no 
longer of much interest. For it was assumed in the course of the proof 
that the functions Q,.(k) are all expansible in powers of k, with coefficients 
given by the moments, which can be true only if aU the moments are 
finite. The resultant of several components, each satisfying the normal 
law, itself satisfies the law exactly. The extreme departure from the 
normal law for each component that would make all the moments 
finite is one where the chance is concentrated in two values, since any 
further spread would amoimt to a smoothing of the distribution and 
make it more like the normal. But we already know that the resultant 
t Calculus of Observations, p. 178. 

t Random Va/riablea and Probability Distributions, 1937. 



DIRECT PROBABILITIES 


81 


of several components in this case would give a binomial law and would 
be approximately normal if there were several components. The proof, 
therefore, added little to what was already obvious. 

The argument has been extended by various writers to the case where 
the components €,. follow independent laws /,(€,) with second moments 

n 

/i 2 ,r about 0, provided that as n^co, ^ and for all r, 

r«l 

->■ 0. In other words, the second moment for the sum tends to 
infinity but the largest proportional contribution from a component 

n 

tends to zero. Then if = 2 

" i 

— 00 

Details are given by Kendall, f 

2 . 64 . If one or more of the components have an infinite mth moment, 
(m > 2) and the number of components is finite, the normal law can 
be approximate only in a rather peculiar sense, for it makes all the 
moments finite, whereas in such a case the mth moment for the resultant 
is infinite. An investigation of a special case is desirable to see what 
this sense can be. But it is convenient to take first the Cauchy law of 
2 . 62 . For the resultant of k components 

{/(^XO), (2) 

and the probability density is 


*'Q(k) d(c = 


7T{k^-\-x^y 


which is of the form for one component, but with the scale multiplied 
by k. The mean of k components from this law follows exactly the 
same law as for one component, a fact emphasized by Fisher. What 
would happen with a large number of observations in this case would 
be that larger and larger deviations would occur, the extremes increas- 
ing so rapidly that the mean will fluctuate by quantities of order 1. 

For a component satisfying the law 


we have 


OM- UM>0}, 

t The Advanced Theory of Statistics, vol. 1^ 1943, 99-103, 180-2. 


8696«fi8 



82 


DIRECT PROBABILITIES 


Chap. II 


For the sum of k such components 


Q(/f) = 


g(r £ Or)^ 

g(r 6,-1 £<!,)« 


{IW) > 0}, 

{/(«) < 0}. 


(6) 


Hence the sum follows the law 
P{d^x,\H) = 




« r )*+(2 6 ,) 2 } 




(7) 


Thus the o, and 6, are both additive. This can also be proved by direct 
integration for the combination of two components and generalized by 
mathematical induction. 

For the Type VII law with index 2, if we reduce the scale in the ratio 
and combine k components, we have for the resultant 


Q.{k\ = I ( 1 — > 0 }- 

I {/(k) < 0}, 


( 8 ) 


and the probability density is 


dz 


too 

1_ I e~Kx^i — dw -f- 

Trt J 
0 

0 

2^1 J 


(9) 


Apart from the factor in x the integrands are real and positive and 
become exponentially small within a distance from the origin of order 
^- 1/2 \Ye gajj approximate to the logarithm of the integrand in powers 
of and find 

too 

— foo 

and the term in is negligible. Then 


This will be valid provided x is not comparable with k'^^. If it is of order 
A'/* or larger, kx and the neglected terms in k® will be comparable. We 
have therefore for large k an approximation of the same nature as that 
found for the binomial; the normal law is a good approximation over 
a range that includes most of the chance. 

If x is comparable with k'^^ or larger a different form of approxima- 
tion is necessary. The method of steepest descents is also unsuitable 
because there is a branch-point at the origin and the paths of steepest 



§ 2.6 


DIRECT PROBABILITIES 


83 


descent from it do not go near the saddle-points. But for the two parts 
the integrands fall off most rapidly in directions in the first and fourth 
quadrants respectively, and we can replace the integrals by those along 
the real axis and then apply Watson’s lemma.f Then 


0 




(1-f d/c (12) 


and we want the. imaginary part of the integral for k small. The first 
non-zero term is 

00 

0 

This is proportional to 2.62(4) for x large, but it is divided by Vfc; 
higher terms will involve higher powers of The effect of comb inin g 
several components is therefore to give an approach to the normal up 
to an indefinitely increasing multiple of the standard error; beyond this 
multiple the law retains the original form except that aU ordinates are 
reduced in approximately the same ratio. The higher moments do in 
fact remain infinite, but the area of the tails is greatly reduced. 

2.65. We can make a little further progress by considering cases 
where the fourth moment is finite. We shall have 




and if we contract the scale in the ratio and combine k components. 


ioo 



6A-% ■^24fc2 






' (Ik. 


( 2 ) 


If some higher moment is infinite, the corresponding derivative of 
0(»c) will not exist at k = 0, and we cannot immediately apply the 
method of steepest descents to (2) because the integrand is not analytic. 
But (2) is a valid approximation when k = 0(k^l'^), and for large k the 
integrand is small. Hence if we drop the last term the error will be 
negligible, and we can apply steepest descents because without this 
term the integrand is analytic. Then, for large k. 


G 


1 

2vi 




(^4— 3)/C«) 

' 2 ' 

^ 24i J 


diK 


(3) 


t H. and B. S. Jofireys, Methods of Mathematical Physics, pp. 471, 668. 



84 


DIRECT PROBABILITIES 


Chap, n 


and if we take the path through x we shall have, nearly. 


Q = 




exp( 


— ia;*)exp|! 




(/j.4— 3)a;* l 
2ik /■ 


(4) 


The correcting factor will become important if x is of order or 

{2ikj{fx^—B)yl*, whichever is the smaller. Thus symmetry and approxi- 
mate normality for the separate components will favour rapid approach 
to normality for the resultant. There is evidence that some errors of 
observation follow a Type VII law with index about 4.f For this, if 
P2 = 1> Ma = the correcting factor is ex]^(x*/12k), for x 

not too large. 

The conditions for the normal law to hold are fairly well satisfied 
in some cases, especially where the observed value is the mean of several 
crude readings. Thus in the standard method of determining the mag- 
netic dip both ends of the needle are read, the needle turned over, the 
case rotated, and the magnetization reversed to eliminate various 
systematic errors. The error of the mean is then the resultant of six- 
teen components, presumably with the same finite second moment, and 
the normal law should be right up to about (12 x 16)’'* =3-8 times the 
standard error. In Bullard’s observations of gravity in East Africa,J 
two separate swings of the pendulums in the field were compared with 
two in Cambridge taken at the same time; the error is therefore the 
resultant of four components, and if the separate laws have index 4 
the normal law should hold up to about 2-6 times the standard error. 
But where there is a dominating source of error there may well be 
considerable departures from the normal law. 

The normal law of error cannot therefore be theoretically proved. 
Its justification is that in representing many types of observations it is 
apparently not far wrong, and is much more convenient to handle than 
others that might or do represent them better. Various theoretical 
attempts at justification have been made, notably Gauss’s proof that if 
the mean is the most probable value, the normal law must hold. But the 
argument would equally imply that since we know many cases where 
the law does not hold the mean is not the best estimate. Indeed, we have 
had Cauchy’s case where the mean is no better than one observation; 
but with a different way of making the estimate we could get much 
higher accmacy from many observations than from one even with this 
law. Whittaker and Robinson (p. 216) give a theoretical argument for 
the principle of the arithmetic mean, but this is fallacious. It depends 

t See later, p. 290. J PhU. Tram. A, 235, 1936, 446-531, 



§2.0 DIRECT PROBABILITIES 85 

on confusion between the measurement of two different quantities in 
terms of the same unit and of the same quantity with respect to two 
different units, and between the difference of two quantities with regard 
to the same origin and the same quantity with regard to different 
origins. The irrelevance of the unit and origin may be legitimate axioms, 
but are replaced by the former pair in the course of the argument.f 
2.66. When several components following the same symmetrical law 
with a finite range are combined, the approach to the normal is very 
rapid. Thus an elementary law may consist of chances at each of 
± 1 . If we combine three such components the second moment for the 
resultant is 3, the possible values being —3, —1, +1, +3. Compare 
the expectations for eight observations with those corresponding to the 
normal law with the same second moment, supposed rounded to the 
nearest odd integer: 

< -4 -3 -1 +1 +3 > +4 

Binomial 0 13 3 10 

Normal 0 084 0-908 3008 3-008 0 908 0-084 

For four components and sixteen observations the expectations in 
ranges about the even numbers are as follows: 

< _5 _4 -_2 0 +2 +4 > +5 

Binomial 0 1 4 6 4 1 0 

Normal 0-10 0-97 3-86 6-13 3-86 0-97 0-10 


In neither case do the probabilities of one observation falling in a 
particular range differ by more than 0-012. It can be shown that if 
the observations were in fact derived from a binomial law with three 
components, and we were given only the totals by ranges to compare 
by the x* test, used in Pearson’s way, with the postulate that they are 
derived from the normal law, it would take about 500 observations to 
reveal a discrepancy. J 

If the primitive law is a rectangular one from — 1 to + 1 , we have 
P{dx\H) = ^dx (— l<x<l) (1) 

r 1 

and Q(k:) = | j (2) 

-1 

For two components the law will be 


F(dx 1 H)ldx = ^ J i(e*'^-2+e-2«)c-''* da = 

L 

This is known as the triangular distribution. 




(0 < X < 2), 

(—2 <x <0). 

( 3 ) 


•f CalcvUia of Observations, pp. 216-17. t Phil. Trans. A, 237, 1938, 236. 



86 


direct probabilities 


Chap. II 


For three components it is 


rTV{3+a;)2 
P{ix\H)ldx = ^{ 6 - 2 * 2 ) 


(-^<x< - 1 ), 

(— 1 <*<!), 

(1 <* <3). 


(4) 


The second moments for (1), (3), and (4) are §, and 1. RescaUng to 
give unit second moment in each case we have from (1) and (3) 


P{dx\H)^^dx (-V3<*<V3), 


( 5 ) 


P(dx I H) 
dx 

while (4) needs no change. 



(0 < * < V6), 
(-V6 <x < 0). 


( 6 ) 



We see at a glance from Fig. 1 that (6) already gives a fair approach 
to the normal, though it has combined only two rectangular distribu- 
tions; while (4) is very close, even at the tails. 

The approach to the normal is much less rapid if the component laws 
are asymmetrical. Thus if three components each give chances § of 
— i and J of +§, the expectations from the results of 27 observations 


are 


-1 

8 


0 

12 


+ 1 
6 


+2 

] 




§ 2.6 


DIRECT PROBABILITIES 


87 


and plainly no normal law can fit all the chances within a little under 
0-04. 


2.7. The X® distribution. Suppose that we have n independent 
variables with normal distributions of chance about zero, so that we 
can write 


P{dx^dx2...dXj^ I H) 

^ (2 Vl2n^ exp[-l/^+^+... + ^\jda;idx2...da;„. (1) 


Consider the total chance that the function 



may fall in a given range. Tiiis can be got by integrating over aU values 
of Xj to x„ that correspond to x^ in this range. First put 

Xj = X 2 = etc. 

Then (3) 

and P(dx^ \ H) = (27r)-'/2" JJ ... J expl-^X^^) dy^-d-y^. (4) 

If we like we can regard the y’s as Cartesian coordinates in n dimen- 
sions and the integral with regard to them as a volume integral. But 
in any case in a range between two neighbouring values of x we can 
neglect the variation of x. while all the y'% are proportional to x- The 
integral from 0 up to a given x> omitting the factor exp( — Jx*)> would 
be proportional to x”; hence the change in it due to a change in x is 
proportional to now, since we can neglect the variation of 

exp( — Ix^) shell, we have 

P(rfx^ |i7)oc x”“^exp(— |x^)dx- (6) 

The constant factor can be found by using the condition that x* is 
certain to lie between 0 and 00 , or the Dirichlet integral may be used. 

Then P{d^ \E) = ^ ^ exp(- |x^) dx- (6) 

It is easy to verify that the expectation of x* is n, as is obvious from 
its definition. The maximum of the integrand is near x* ~ If we 
neglect a factor x”^ and take logarithms, 

^a(«logX-k®) = -2 (7) 

near the maximum, whence if n is large P(dx^ I is nearly proportional 



DIRECT PROBABILITIES 


88 


Chap, n 


to exp{ — {x — ^/n)^}dx or to exp{— (x*— Thus, roughly, we 
can write • X® = »±V(2w) W 

as a summary expression of the rule. Tables giving Pix^), the chance 
that x^ ‘wfh exceed a given value, are given by Pearson, Fisher, and 
Yule and Kendall. 

The interest of this rule is that it often enables us to see very easily 
whether a set of data are consistent with a hypothesis. It is required 
that we shall have a set of estimates, obtained independently, on a 
hypothesis that gives estimates of the standard errors, and that we 
compare them with a set of values predicted by the hypothesis. In 
general the observed and theoretical values will differ by quantities of 
the order of the standard errors, but if we form x* v e have a quantity 
that would be increased either by an unexpected systematic variation 
(the random variation remaining the same), by the actual random 
variation being larger than that expected, or by some internal correla- 
tion that makes errors tend to repeat themselves, when the means will 
vary more than expected. If x* is less than 7i-|-^(2n) we can usually 
say at once that the observations agree with the theory as well as could 
be expected, and if it is less than »-(-2^(2n) there is no immediate need 
to discard the hypothesis. The matter will be treated in more detail 
later, but these simple considerations so often cover aU that is wanted 
that they may as well be stated at the outset. 

2 . 71 . It often (or rather usually) happens that the hypothesis 
investigated contains some adjustable parameters, and that these are 
determined in such a way as to make x^ a minimum. If they are fewer 
than the a;’s there will still be an outstanding variation, but we should 
naturally expect it to be smaller than the original one. Instead of all 
the a:’s being independent, we must now suppose that the information 
with respect to them can be written 

(9) 

where the are known, but a is to be found, and can be con- 

sidered random. Then on this hypothesis 


P{dxi...dx„ 


. (27r)-V2« 

[ aH) = - — exp 




-2 


2(7^ 


21 

- dx^...dXn. (10) 

, 


Now suppose that we determine the value of a, a say, that makes 
2 {Xj.—lfixYlc^ a minimum. Then 

N' ^r(^r /t t\ 



§2.7 


DIRECT PROBABILITIES 


89 


and 


V (®r— 

^ 2o? 




{oL-a^y 


20 ?- 


( 12 ) 


The first term on the right is the value of would be found by 

comparing the with l^a instead of with 0 or Hence 


P{dxi...dxn I oJI) = ^ exp[-^X^-(a-c)2 V ■^dx^...dx„. 

(13) 

The form of this shows that the information about the x^ can be 
regarded as composed of three independent parts. For they would all 
be determined by a, x, and n~2 direction parameters of the form 
= {x^—lfa)la^X- If we change to these as new variables the three 
groups of chances will be independent, and by applying Theorem 12 
we have 


P(dx I oc, a, m^, H) 

oc “C x”'*exp(-ix*)«^X- (14) 

Thus the determination and elimination of each adjustable constant 
reduces the index of x in the chance for the outstanding variation by 1 . 
The difference between the number of separate data and the number 
of parameters allowed for is usually called the number of degrees of 
freedom. If this is identified with the n of (6) the formula will always 
hold. 

It may be noticed that on the left dx means the proposition that x 
will lie in a particular range dx', dx* means that x^ will lie in the corre- 
sponding range dx^- These propositions are equivalent and can there- 
fore be interchanged in the expression on the left by Theorem 3. 

2.72. If there is a linear constraint on the data, so that 

= 0 , 

this also will remove one variable from the integration and reduce the 
degrees of freedom by 1 and also the index of the distribution. 

2.73. X* was first obtained by Pearson in relation to a problem of 
sampling.! In the latter case it can be simply derived from the last 
remark. Suppose that we are sampling an enormous population of 
several different types, and that the expectations in a sample, given 
the time of sampling and the proportions in the population, are 
T/ij,..., m^. Then if these are moderate numbers and the occurrences 
of members of different types do not interfere, each type will give an 


t Phil. Mag. 50, 1900, 167-76. 



90 


DIRECT PROBABILITIES 


Chap. II 


independent Poisson distribution and the expected number may be 
written If the observed numbers are we have therefore 


X® = 2 

taken over all types. The degrees of freedom will be p. Such a case 
might be realized if we were observing a phenomenon for a finite time, 
so that the total number of events was subject to a sampling variation, 
besides the separate variations of the numbers of the types. 

2.74. But if we are extracting from a population a sample of given 
size, the total number of the sample is known as iV = 2 ™r- 
expectations are assessed in given ratios, but now are subject to the 
total of being N, we have introduced a linear constraint and 
the number of degrees of freedom wdll be p— 1. A detailed treatment, 
following Pearson, is as follows. We return to the multinomial rule. 
If N is prescribed, and subject to N the expectations are m^, the 

probability of a sample n^,..., is 


Put 





( 1 ) 

( 2 ) 


where 2 “r = 0. Then 


log n KO ^plog27r- 2 %+ 2 (w,+ i)logn„ 
logiV! = Jlog27r— .W+(A’4-i)logiV’, 

JJ ffin. 


\ NH) = 




(3) 

(4) 


^ 

(27r)V2(v-i) JJ („%) n 


(5) 


which gives, on approximating to order ap, 

A’/'* / 1 V 

2Z mj 


and 



2 


m. 


(6) 

(7) 


The probability distribution of x> given the m^, is now to be found by 
integration. But only p~ 1 of the n, can be varied independently, and 
the result wiU be 


Pidx\mi...mpH)cc ( 8 ) 

2.75. If the analysis refers to a rectangular contingency table and 
we wish to test whether the elements agree with the hypothesis that the 
chances in different rows are in proportion, further degrees of freedom 



§ 2.7 


DIRECT PROBABILITIES 


91 


disappear. For in such a case the ratios of the total chances in the 
rows or in the columns are not fixed initially and must be estimated 
from the data. Thus if there are m rows and n columns, we fix ot 
parameters from the numbers in the rows and n— 1 from the columns. 
The expectations being made in proportion, consistently with the row 
and column totals, the number of degrees of freedom that remain in 
is 

Urn = n — 2 the number therefore reduces to 1. 

2 . 76 . The analysis is of enormous use. It is easy to apply, and 
very often is enough to answer the question asked. This means really 
that the hypothesis stated is very often right and the predictions made 
by it come off. It does not, however, always go into sufficient detail. 
More will be said about this under significance tests. The trouble is that 
it combines all degrees of freedom together as if they were all relevant 
to the same question, whereas only part of the information in them may 
be relevant. If, for instance, we have a set of data with 32 degrees of 
freedom, the expected on the hypothesis of complete randomness 
will be 32 ±8, which means that in the ordinary course of events it may 
be anything from 24 to 40 and might go beyond this range without any- 
thing but random error being involved. If there is actually a systematic 
variation whose amount is four times its standard error, it will con- 
tribute 16 to X*: but if the other degrees of freedom happen to contribute 
only 24 the total will still be 40, which would pass as entirely random. 
But a systematic variation of 4 times its standard error would be 
accepted as genuine by any significance test if it was tested directly. 
The trouble is that with regard to a large number of data we may want 
to ask several questions. To some of them the answer will be ‘yes’, to 
others ‘no’. But if we try to sura up all the information in one number 
we shall not know what question we have answered. It is desirable to 
arrange the work, when several questions arise simultaneously, so as to 
provide answers to each of them separately. When this is done it is still 
found that the x^ form persists, but it is now broken up into separate 
parts each of which has its own message. 

The passage from (5) to (6) above involves the neglect of cubic terms. 
In Pearson’s earlier work he ignored the resulting errors, sometimes 
applying the result when the expectation was considerably less than 1. 
Later he recommended grouping the small expectations together so 
that the expectation in no group would be less than 6. This has the 
disadvantage that in, for instance, a test of the normal law of errors, 



92 


DIRECT PROBABILITIES 


Chap. II 


an observation in a range where there might be a 0-00 1 chance that 
any would occur on the normal law, and taken by itself would be strong 
evidence against the law, cannot be considered except in combination 
with several others, and there is considerable loss in sensitiveness. Both 
methods have drawbacks in dealing with small groups, but where the 
expectations are over 1 the earlier method seems to be the better; 
where they are mider 1 the only solution seems to be to introduce a new 
parameter explicitly and estimate it. Then the relevant part of 
the square of the ratio of the new parameter to its standard error. 

2.8. The t and z distributions. Suppose that we have n observations 
derived from the normal law with true value x and standard eiror a. 
Their joint chance is 

P(dxi...dx^\x,o,H) = -—l—eii^\^—~'2^(x^~xf^dxy..dx^. (1) 

Put raf = 2 («— 1)5* = ns'* = 2 (2) 

Then x is the arithmetic mean and s is the standard deviation as usually 
defined. We shall call s' the mean square deviation, x, s, and s' are 
all determinate functions of the observed values. In the present problem 
writing is simplified by using s' rather than s, but when we come to the 
method of least squares we shall find that s has advantages. Also 

2 (x,.-a:)* = 2 {(Xr-x)+{x~x)Y 

= 2 (x,.— x)*+2(x— x) 2 (x,.— x)+n(x-x)*. (3) 

The second term vanishes by the definition of x, and the result is 

ns'*+n(x — x)*. 

Hence 

P{dx^...dx^\x,a,H) = ^^_^expj-^{(x-x)*+s'*}j dx^...dx„. 

W 

Thus X and s or s' are what Fisher calls sufficient statistics. A ‘statistic’ 
in his terminology is any function of the observations that we might 
choose to provide an estimate of an unknown parameter in a law. 
We have seen that, whatever the prior probability may be, the observa- 
tions enter into the posterior probability only through the likelihood, 
which in this case is the function we have just given. Also in practice 
the observations are not exact determinations since we read only to 
the nearest convenient multiple of some convenient unit. A reading of 
16‘3 mm. means really an observation between 16-25 and 15-35 mm. 



DIRECT PROBABILITIES 


93 


This range of 0*1 mm. would replace dx^ in practice, and it is the same 
whatever the parameters in the law. Hence, when we apply the principle 
of inverse probability, the factor dxj^...dx^ is the same for all values of 
the unknowns x and a, and will cancel. It follows that the whole of the 
information with respect to x and a that is contained in the observa- 
tions is summarized in the two statistics x and a. When this occurs it 
is unnecessary to make further reference to the observations apart from 
these statistics, which are therefore called sufficient. A definition of a 
sufficient statistic is as follows. Whenever the likelihood, apart from 
factors independent of the unknown parameters to be estimated, can 
be expressed as a function of the xmknown parameters, the number 
of observations, and a number of functions of the observations equal 
to the number of unknown parameters, those functions of the observa- 
tions are called sufficient statistics. 

For various purposes we require to know the joint probability distri- 
bution of X and s', given x and a. Then we must consider a pair of ranges 
of X and s' and form the integral of (4) over all values of the observable 
values Xf that give x and s' in these ranges. This is easily done as follows, 
by translating into analytic language a geometrical argument due to 
Fisher. We can regard as a set of rectangular coordinates of a point 
in n-dimensional space, and then J (x^—x)^ is the square of the distance 
of this point from a point all of whose coordinates are x. But we can 
rotate the axes in any way, and this will still hold for the new axes. 
In analytic language, we can form n linear functions of the x^ such that 


if a new function is a:'- 


where 


= 2 
r 

= 1 ; 


2 «tra>r = 0. 


and this can be done in an infinity of ways. We can choose one of the 

Xj to be / ''sp / / — / /n\ 

a"i = 2, = 'x:\n. (7) 

r 

Applying this to the point (x,x,x,...) gives (arVw, 0, 0,...). 

Then 

expj-^2 d^v-dx^ 

through any region, where 2' denotes summation for all i except i = 1. 

I' x'^ = 2 {x,-xf = ns'i*. (9) 



DIRECT PROBABILITIES 


94 


Chap. II 


Hence if we consider a region between two fixed values of and two 


fixed values of s', the integral breaks up into two factors 

3 / 1 + dx\ 

j* exp|— ^(x'— a:VW)2|da:;, (10) 

Xx 

^2 = j J-J — (1^) 

integration in the latter case being over aU values such that 

ns'2 < 2' < «(s'+ds')®. (12) 

Within short ranges of x{ and s, therefore, the integral 

« expj— ^(x;-xV»)*|dx;.s'«-2exp|— — jds' (13) 

QC expj— — (x— x)2|dx.s'”-2exp| — (14) 


The constant factor is determined by the condition that x is certain to 
be between ±00 and s' between 0 and 00 . Hence 

P(dxd8' I X, a, H) 

(16) 

The argument fails if n = 1 , for then s' is necessarily 0 and the factor 
does not arise. 

Now put X— X = s'z (16) 

and transform to variables s' and z. We have now 



P{dzds' \x,a,H) 





ds'dz, 


and finally, on integrating with regard to s'. 


(17) 


P{dz\x,o,H) = (18) 

This rule was first obtained by W. L. Gosset, a prominent statistical 
writer who used the nom de plume of ‘Student’. Its remarkable feature 
is that it is independent of x and a, which may therefore be suppressed; 
their actual values are irrelevant to z, and their existence is implied 
by H, which includes the statement that the normal law holds in the 


t Biometrika, 6, 1908, 1-26. 



§ 2.8 


DIRECT PROBABILITIES 


95 


problem under discussion. It may be transformed by introducing the 
quantities ^ a' _ _s_ _ 

\ n{n~l) j ’ 


(19) 


« = (n_l)V22. (20) 

Then is the usual conventional estimate of the standard error of a 
mean’)' and t is the ratio of the actual error of the mean to the estimated 
standard error. We shall then have 

( 21 ) 

This is now the usually adopted form, and is called the t distribution. 
If n is large it tends to the normal vith standard error 1, but for 
moderate values of n it is more widely spread to large values of t. This 
represents the fact that, given x and a, the probabilities of different 
values of x and s’ are independent. Consequently, while those of x 
follow the normal law with standard error tr/Vn, in any individual case 
the error of x may be associated with a value of s' either more or less 
than a, and s^ as calculated from s may be either more or less than 
(t/Vw. The result is that there is a considerable chance that an error of 
X larger than cr/Vw will be associated with a value of less than cr/Vn, 
and the result will be to give an excess chance of large values of f in 
comparison with that for xja on the normal law. 

2.81. Suppose now that we have two separate samples of and 
derived from a normal law with the same parameters, and that their 
means and mean square deviations are and x^, 5^ and Sg- What is the 
joint chance of these four quantities lying in prescribed ranges, given 
X and (T ? Since the law is one of chance neither set can give any informa- 
tion about the other when x and <7 are given; hence by the product rule 

P(dxi dx^ ds'i ds '2 \x,a,H) 


2770 -^ 


exp j - ^ (ii-x)®! dxi exp j — ^ (^2-a:)*j dx^ X 


^ 2 ^ ^ - ^V-3)( K - 1 ) ! - -U) X 

7^/3"i-V25^ni-2 / n^sl\,, 


( 22 ) 


t It has no unique standard error since the posterior probability of the true value, 
given the mean and standard deviation, is not normally distributed. 



96 


DIKECT PROBABILITIES 


CSiap. n 


which is the product of four independent factors. Now consider the 
chance that will lie between s'^y and s\{y-{-dy). For all values of 
and *2 w® have, by Theorem 12, 


P{dyds^\Xi,X2,x, a,H) 

~~ — I)! f)! 

and, integrating with regard to 4, 


'n'2-\rn^y‘ 

2a* 


4*j dyds'^ 


(23) 


P{dy I fi, * 2 , X, a, H) 
Now put y = e^. 


(i%— I)! (i«2— I)! 


dy. (24) 


P{dZ\xi,X2,x,a,H) 

(^Ki— I)! (i«2— f)! (n2+nie*^)'^^<"‘+"‘~®’ 


This, with a change of variable, is Fisher’s z distribution. f If we take 
*'i = *'2 = ^ 2 ~j^ (following Yule and Kendall in this notation), 

we have visf = v^sl = n^si, logisjs^) = z, 


P{dz 1 Xi, Xj, X, a, F) 


2 vi**'»v 2 ^**'»(ivi+^>' 2 — 1)! e'’^‘dz . 

(K-l)!(i»'2-l)! (.^2+Vie**)V2(v.+..)- ^ > 


This is Fisher’s form. It is curious that the factors that arise in the 
transformation should cancel so completely. In practice there is an 
arbitrariness as to which of the standard deviations we should call Sj; 
the larger is taken, so that z in actual use is always positive. It is easy 
to verify that interchanging and and reversing the sign of z leaves 
(26) unaltered. But apart from this conventional restriction z can range 
from —00 to +Q0, unlike y, which can only range from 0 to oo, and the 
law for z is therefore much more symmetrical. The law is in fact nearly 
normal for moderate departures of z from 0, and may be conveniently 
represented by . , , , , ^ w /2 

Detailed tables of the values of z with 5, 1 and 0-1 per cent, chances of 
being exceeded on the hypothesis of random variation are given by 
Fisher.J 

2.82. The z rule may be regarded as a generalization of x^- The x® 
rule assumes that the data are either derived from the normal law with 


t Pros. Boy. Soc. A, 121, 1028, 669. 
j Statistical Methods for Besearch Workers, Table VI. 



§ 2.8 


DIRECT PROBABILITIES 


97 


known standard errors, or approximately so derived with standard 
errors calculable from frequencies, and the probable scatter of the data 
is compared with the known standard errors. In the z rule, the scatter 
of one set of estimates is compared with that of another set, each being 
measured by the standard deviation and not by the standard error, 
and consequently both numbers of degrees of freedom appear in the 
result. But it is supposed that each estimate of either set has the same 
standard error. This is achieved in biological experiments by what is 
called a balanced design (cf. 4.9). In physics it is hardly ever achieved; 
the essence of comparison of physical estimates is usually that they 
have been obtained by different methods and consequently have 
different standard errors. We therefore need a method to replace the 
z rule in such conditions; we can hope only for an approximate answer, 
but some answer is necessary. 

If we have several series of estimates with estimated standard 
errors c, based on v,, d.f., we might suggest forming the sum 



( 1 ) 


for the series together, measuring each x, from a weighted mean of the 
Xj.. This is the simplest analogue of When all the v, are large it is 
fairly satisfactory. If there are n estimates the number of degrees of 
freedom is n — 1. But if the v, are not large this function will not follow 
the same rule as The expectation of from (1) is not 1 but — 2) 
for > 2; for v,. < 2 it is infinite. Consequently, if we estimate x^^ 
using the estimated standard errors, the estimate vtU be about 


2 



( 2 ) 


instead of n— 1. This may be serious. Suppose that we have 10 series 
of 5 observations each, and form x® in this way from the means. The 
expectation of x* will be 19 instead of 9. But on 9 d.f. = 19 is nearly 
up to the 2 per cent, point, and such a set of means will habitually be 
judged discordant even if the variation is wholly random. 

A better method is suggested by the central limit theorem. We have, 
if E denotes an expectation, 

(v-2)2(v-4) 

and if /'^ (4) 

V Aj v— 1 

the expectation of {t'^—Et'^Y is always 2 for i/ > 4, and ^ 
sags.ts 



DIRECT PROBABILITIES 


08 


Chap. II 


will have a nearly normal probability distribution for n more than 
about 3 or 4. Then 


E0 = 



It? 




( 6 ) 


if the true value is taken as 0, If one weighted mean is determined it 
will be allowed for approximately by multiplying the first term by 
(»— 1 )jn and replacing *J{2n) by ^(2n— 2) in the second. It now becomes 
impossible to include in the test any estimates based on fewer than 4 
d.f., but if those with v, > 4 are found accordant they can be com- 
bined, and then those with v, ^ 4 can be compared with them 
individually. 

The method is necessarily rough, but should serve as a useful compro- 
mise capable of being used in the same way as x^- Like x^ 2 , it will 
not always be the end of the matter, but will pronde a simple way of 
seeing whether it is worth while to go into greater detail. 



Ill 

ESTIMATION PROBLEMS 

‘We’ve got to stand on our heads, as men of intellect should.’ 

R. Austin Fkeeman, The Red Thumb Mark 

3.0. In the problems of the last chapter we were considering the proba- 
bilities that various observable events would occur, given certain laws 
and the values of all parameters included in these laws. The usual use 
of these results is that they provide the likehhood for different values 
of the parameters; then, taking the observed results as given, and using 
the principle of inverse probability, we can assess the relative probabili- 
ties of the different values of the parameters. A problem of estimation 
is one where we are given the form of the law, in which certain para- 
meters can be treated as unknown, no special consideration needing to 
be given to any particular v^alues, and we want the probability distribu- 
tions of these parameters, given the observations. 

Now from any finite number of observations we can never evaluate 
more than a certain number of parameters. A sample {l,m) cannot 
determine more than two parameters and, since l-\-m is in practice 
chosen for convenience and has no reference beyond the sample, there 
will be only one parameter that has any relevance beyond the sample 
itself. A set of n quantitative observations cannot determine more 
than n adjustable parameters; but if we always admitted the full n we 
should be back at our original position, since a new parameter would 
imply a new function, and we should change our law with every observa- 
tion. Thus the principle that laws have some validity beyond the 
original data would be abandoned. It is necessary, therefore, to the 
statement of a scientific law that it involves a number of adjustable para- 
meters (possibly none) and that new observations do not alter the form of 
the law, though they may alter the estimates of the parameters. Thelikeli- 
hood of a given set of observations has no definite value unless the form 
of the law is given and all the parameters in the law are explicitly stated. 

On the other hand, a law is not a final statement. By rule 6 we can 
rule out no law as impossible a priori, and if a true law involves n 
parameters it could not be found until there are more than n relevant 
observations. Hence the number of parameters in the laws that it is 
possible to consider at any time depends on the number of observations. 
Thus it is a necessity of progress that laws must be considered, on the 
whole, in the order of increasing number of adjustable parameters. 



100 


ESTIMATION PBOBLEMS 


Chap, m 


The function of significance tests is to provide a way of arriving, in 
suitable cases, at a decision that at least one new parameter is needed 
to give an adequate representation of the existing data and valid 
inferences to future ones. But we must not deny in advance that those 
already considered are adequate, the outstanding variation being legiti- 
mately treated as random. Though we do not claim that our laws are 
necessarily final statements, we claim that they may be, and that on 
suflScient evidence they have high probabilities. But by rule 5 we can 
set no limit to the number of possible laws, and this is the same as saying 
that the number is infinite. If all laws had the same prior probability 
it would be infinitesimal, and would remain infinitesimal on any amount 
of evidence. Thus there could be no stop, nor even a temporary pause, 
unless we agree that every law has a finite prior probability. But then 
if there are an infinite number of possible laws tlieir prior probabilities 
must form a convergent series. 

This result implies the possibility of arranging possible laws in an 
order of decreasing prior probability. What can this order be? The 
methods capable of being adopted, which are mainly those already in 
use, provide our answer. It is the order in which the laws ordinarily 
arise for consideration, that of increasing number of adjustable para- 
meters. This principle of convergence was what Wrinch and I originally 
called the simplicity postulate.^ It is not, however, a separate postulate 
but an immediate application of rule 5. We stated it in a way applicable 
only to quantitative laws expressed by differential equations, and in 
Scientific Inference I gave a quantitative definition of the complexity 
of a differential equation. This, however, appears insuificiently general, 
because it is not clear that all laws are expressible by differential equa- 
tions; for instance, ‘aU crows are black’, ‘the chance of throwing a head 
with a penny is and the various non-commutative rules of quantum 
theory. It appears much better not to restrict the possible types of law 
at all, but merely to be ready for them as they may arise for considera- 
tion, whatever their form. This makes the relation to actual thought 
immediate. The complexity of a law is now merely the number of ad- 
justable parameters in it, and this number is recognizable at once; we 
can satisfy rule 3. There is no objection to regarding such laws a,ayocx 
and yccx^ as of equal complexity, because their consequences will 
usually differ so much that discrimination between them by means of 
observations will be easy; laws involving the same number of adjustable 
parameters can be taken as having the same prior probability. When 

t Phil. Mag. 42, 1921, 369-90. 



§8.0 ESTIMATION PROBLEMS 101 

the question of modifying a law first arises, the suggested modification 
must be stated, in most cases, in such a form that it involves one new 
parameter. (A modification to a law of different form, but involving 
the same number of parameters, can be tested directly. The more 
probable is the one with the higher likelihood.) The question will 
then be. Is the new parameter supported by the observations, or is 
any variation expressible by it better interpreted as random ? Thus 
we must set up two hypotheses for comparison, with equal prior pro- 
babilities, so as to say that we have no grounds for expecting it to be 
present or not. 

But if the parameters already introduced are a^, aj,-.-, am> 
question is whether we should introduce another, we can choose 
it so that making it zero will reproduce the old law. This is equivalent, 
therefore, to saying that we can proceed directly to the law containing 
but that if we do, half the prior probability is concentrated at 
“m+i = We shall see under significance tests how this procedure 
leads to a test of whether the new parameter is supported by the 
evidence. At present we need only notice that a parameter that arises 
in a pure problem of estimation often presupposes a significance test 
that has disposed of some suggested value that it would have in a 
simpler law. A significance test itself, if it shows that a new parameter 
is needed, will lead to an estimate of it on the way. But there are many 
cases where tests have been applied in analogous cases, or where the 
evidence is so clear that a quantitative test of significance hardly needs 
to be applied. For instance, the latitude and longitude of the epicentre 
and the time of occurrence are obviously relevant parameters to the 
observations of an earthquake. In a problem of estimation, then, we 
proceed entirely on the hypothesis that the law is given and that the 
stated parameters and no others are needed. Their actual values are 
unknown and our object is to find estimates of them. Though estima- 
tion problems really presuppose the solution of the corresponding signifi- 
cance ones, it is convenient to take them first because they are easier 
mathematically and because in many cases the answer to the significance 
question is already known. 

3.1. Our first problem is to find a way of saying that the magnitude of 
a parameter is unknown, when none of the possible values need special 
attention. Two rules appear to cover the commonest cases. If the 
parameter may have any value in a finite range, or from — oo to -f-oo, 
its prior probability should be taken as uniformly distributed. If it 



102 


ESTIMATION PBOBLEMS 


Chap, m 


arises in such a way that it may conceivably have any value from 0 to 
00, the prior probability of its logarithm should be taken as uniformly 
distributed. There are cases of estimation where a law can be equally 
well expressed in terms of several different sets of parameters, and it is 
desirable to have a rule that will lead to the same results whichever set 
we choose. Otherwise we shall again be in danger of using different 
rules arbitrarily to suit our taste. It is now known that a rule with this 
property of invariance exists, and is capable of very wide, though not 
universal, application. 

The essential function of these rules is to provide a formal way of 
expressing ignorance of the value of the parameter over the range 
permitted. They make no statement of how frequently that parameter, 
or other analogous parameters, occur within different ranges. Their 
function is simply to give formal rules, as impersonal as possible, that 
will enable the theory to begin. Starting wdth any distribution of prior 
probability and taking account of successive batches of data by the 
principle of inverse probabih'ty, we shall in any case be able to develop 
an account of the corresponding probability at any assigned state of 
knowledge. There is no logical problem about the intermediate steps 
that has not already been considered. But there is one at the beginning: 
how can we assign the prior probability w'hen we know nothing about 
the value of the parameter, except the very vague knowledge just indi- 
cated ? The answer is really clear enough when it is recognized that a 
probability is merely a number associated with a degree of reasonable 
confidence and has no purpose except to give it a formal expression. If 
we have no information relevant to the actual value of a parameter, the 
probability must be chosen so as to express the fact that we have none. 
It must say nothing about the value of the parameter, except the bare 
fact that it may possibly, by its very nature, be restricted to lie within 
certain definite limits. 

The uniform distribution of the prior probability w as used by Bayes 
and Laplace in relation to problems of sampMng, and by Laplace in some 
problems of measurement. The problem in samphng would be, given 
the total number in the population sampled, to u.se the sample to esti- 
mate the numbers of different types in the population. We are prepared 
for any composition if we know nothing about the population to start 
wdth. Hence the rule must be such as to say that we know nothing 
about it; and Bayes and Laplace did this by taking the prior probabili- 
ties of aU possible numbers in the population the same and leaving the 
entire decision to the sample. 



ESTIMATION PROBLEMS 


103 


§ 3.1 

4>. 

Bayes and Laplace, having got so far, unfortunately stopped there, 
and the weight of their authority seems to have led to the idea that the 
uniform distribution of the prior probability was a final statement for 
all problems whatever, and also that it was a necessary part of the 
principle of inverse probability. There is no more need for the latter 
idea than there is to say that an oven that has once cooked roast beef 
can never cook anything but roast beef. The fatal objection to the 
universal application of the uniform distribution is that it would make 
any significance test impossible. If a new parameter is being considered, 
the uniform distribution of prior probability for it would practically 
always lead to the result that the most probable value is different from 
zero — the exceptional case being that of a remarkable numerical coinci- 
dence. Thus any law expressed in terms of a finite number of parameters 
would always be rejected when the number of observations comes to 
be more than the number of parameters determined. In fact, however, 
the simple rule is retained and the new parameter rejected, at any rate 
until the latter exceeds a few times its standard error. I maintain that 


the only ground that we can possibly have for not always rejecting the 
simple law is that we believe that it is quite likely to be true — that is, 
that when we have allowed for the variation accounted for by the 
functions involved in it the rest of the variation is legitimately treated 
as random, and that we shall get more accurate predictions by proceed- 
ing in this way. We do not assert it as certain, but w'e do seriously 
consider that it may be true — in other words, it has a non-zero prior 
probability, which is the prior probability that the new parameter, 
which is the coefficient of a new function, is zero. But that is a recogni- 
tion that for the purpose of significance tests, at least, the uniform 
distribution of the prior probability is invalid. 

The uniform distribution of the prior probability was applied to the 
standard error by Gauss, who, however, seems to have found something 
unsatisfactory about it. At any rate there is an obvious difficulty. If 


we take 


P{da I H) oc da 


as a statement that a may have any value between 0 and oo, and want 
to compare probabilities for finite ranges of a, we must use oo instead 
of 1 to denote certainty on data H. There is no difficulty in this 
because the number assigned to certainty is conventional. It is usually 
convenient to take 1 , but there is nothing to say that it always is. But if 
we take any finite value of a, say a, the number for the probability that 
a < (X will be finite, and the number for or > a will be infinite. Thus 



104 


ESTIMATION PROBLEMS 


Chap, in 


the rule would say that whatever finite value a we may choose, if we 
introduce Convention 3, the probability that ct < a is 0. This is incon- 
sistent with the statement that we know nothing about a. 

This is, I think, the essence of the difficulty about the uniform assess- 
ment in problems of estimation. It cannot be applied to a parameter 
with a semi-infinite range of possible values. Other objections that have 
been made at various times turn on the point that if a parameter is 
unknown then any power of it is unknown; but if such a parameter is v, 
then if v lies between and v^-\-dv, we should have according to the 

P(vi <v < v^+dv I H) oc dv, 

and if we try to apply the rule also to r™ we should say also 

P{v'i <■!;"< {v^-\-dvY 1 H) oc dw" oc v^-'^dv. 

The propositions considered on the left are equivalent, but the assess- 
ments on the right differ by the variable factor There are cases 
where this problem has arisen. For instance, in the law connecting the 
mass and volume of a substance it seems equally legitimate to express 
it in terms of the density or the specific volume, which are reciprocals, 
and if the uniform rule was adopted for one it would be wrong for the 
other. Some methods of measuring the charge on an electron give e, 
others e*; but de and de® are not proportional. In discussing errors of 
measurement we do in fact usually represent them in terms of the 
standard error; but there is no conclusive reason why we should not 
use the precision constant A = 1 /aV2, and da is not proportional to dh. 
But while many people had noticed this difficulty about the uniform 
assessment, they all appear to have thought that it was an essential 
part of the foundations laid by Laplace that it should be adopted in 
all cases whatever, regardless of the nature of the problem. The result 
has been to a very large extent that instead of trying to see whether 
there was any more satisfactory form of the prior probability, a succes- 
sion of authors have said that the prior probability is nonsense and 
therefore that the principle of inverse probability, which cannot work 
without it, is nonsense too. 

The way out is in fact very easy. If vp is constant, then 

^+^ = 0 . 

V p 

If then V is capable of any value from 0 to oo, and we take its prior 
probability distribution as proportional to dvjv, then p is also capable 



ESTIMATION PROBLEMS 


106 


I 8.1 

$ 

of any value from 0 to oo, and if we take its prior probability as pro- 
portional to dpjp we have two perfectly consistent statements of the 
same form. Similarly, for any other power, dviv and dv'^jif^ are always 
proportional, and the constant ratio will be absorbed in the adjustable 
factor. If we have to express previous ignorance of the value of a 
quantity over an infinite range, we have seen that to avoid dealing with 

ratios of infinitesimals we shall have to represent certainty by infinity 

00 

instead of 1 ; thus the fact that J dvjv diverges at both limits is a satis- 

0 

factory feature. This argument is equally apphcable if v is restricted 
to he between values v^, v^, for 

dv dt)" 

v\og(vJVi) “ V“log(t)J/vf)' 

This point is relevant to the fact that in many practical problems we 
are not totally ignorant of the standard error when we start. Some 
knowledge of it is implied by our choice of measuring instruments, 
which must be capable of reading to less than the standard error and 
must cover ranges greater than that likely to be covered by the observa- 
tions. Thus we usually have some vague knowledge initially that fixes 
upper and lower bounds to the standard error. But dv/v remains the 
only rule that is invariant for powers. If in an actual series of observa- 
tions the standard deviation is much more than the smallest admissible 
value of a, and much less than the largest, the truncation of the distribu- 
tion makes a negligible change in the results. 

The point may be put in another way. If a parameter v is a dimensional 
magnitude and not a number, and we want to assess P{dv | H), where 
H contains no information about v except that it is positive, this can 
only be of the form Av'" dv, where A and n are constants. For the ratio 
of two probabilities must be a number, which would not be satisfied if 
we took the first factor, say, as sin v — the sine of a length means nothing. 
Nor could it be, say, where a is some constant of the same dimen- 
sions as V. For then it would assign a definite value to the ratio of the 
probabilities that v is less or greater than a. If, then, a is known, it 
contradicts the condition that we know nothing about v except its 
existence and that it lies between 0 and -j-oo ; if it is not known we should 
have to provide a rule for estimating it or for saying that it is unknown, 
and in either case we are no further forward. The coefficient of dv must 
be something that involves no magnitude other than v, and if t? is 
dimensional this can be satisfied only by a power of v. But now if we 



ESTIMATION PROBLEMS 


Chap, ni 


consider some fixed value a the ratio of the probabilities that v is less 
or greater than a is 

a ,00 

j dv I j v”^ dv. 


\f n > — 1, the numerator is finite and the denominator infinite. We 
could therefore introduce Convention 3 and say that the probability 
that V is less than any finite value is 0. If » < — 1, the numerator is 
infinite and the denominator finite, and the rule would say that the 
probability that v is greater than any finite value is 0. Both of these 
would therefore be inconsistent with saying that we know nothing 
about V. But if n = — 1 , both integrals diverge and the ratio is indeter- 
minate. We cannot now use Convention 3. Thus we attach no value 


to the probability that v is greater or less than a, which is a statement 
that we know nothing about v except that it is between 0 and oo. Thus 


the form 


P{dv \ H) oc dvjv 


is the only satisfactory one. 

I have recently had an objection to it, that if we fix tu'o possible 
values a and 6 the rule will lead to the statement that the probability 
that V lies between a and 6 is 0; and it is inferred from this that the rule 
says that v is either 0 or oo and can have no finite value at all. To the 
first point I should answer that if we know nothing about v except that 
it may have any value over an infinite range we must in any case 
regard it as a remarkable coincidence if it should be found in a particular 
arbitrary finite range. If a and b are not arbitrary but are suggested 
by some previous information, then v is not initially unknown and the 
previous information should be allowed for. To the second point I 
should say that what the rule says is that we attach oo as the number 
to represent the total probability of all finite values; it says nothing at 
all about the probability of an infinite or zero value. It is easy to invent 
mathematical functions that are ever 3 rwhere finite but whose integrals 


diverge, such as 


fix) = 1/a: (x ^ 0), 


fix) =1 (x = 0). 

Fundamentally the fallacy in the argument is that it assumes the con- 
verse of Theorem 2 in the type of case where zero probability does not 
imply impossibility. 

The rule seems to cover all dimensional magnitudes that might con- 
ceivably have any value from 0 to oo ; and aU cases where it appears 



ESTIMATION PROBLEMS 


107 


f 3.1 




equaUy natural to take a quantity or some power of it as the parameter 
to be estimated. The extension to all cases where we want to say that 
a quantity is initially unknown except that it must he between 0 and 
00 is done by rule 6, that we must introduce the minimum number of 
independent postulates. If we used a different rule in other such cases 
we should be making an unnecessary postulate. 

If P(dv I H) oc dvjv, it is also proportional to d log v, and log v can have 
any value from — oo to The rule is therefore consistent with the 

adoption of a uniform distribution for the prior probability of a quan- 
tity restricted only to be real. It appears inconsistent at first sight with 
the uniform assessment for a quantity with a finite range of possible 
values. If such a quantity is x and must lie betw^een 0 and 1, xj{l — x) 
is a quantity restricted to lie between 0 and oo; which suggests taking 
a rule suggested by Haldane: 


P(dx I H) oc 


1 —X , X dx 

— d oc . 

X l—x a;(l— x) 


Laplace ’s and Bayes ’s assessments in the sampling problem were simply 
dx. Haldane’s form gives infinite density at the limits. In spite of the 
apparent inconsistency I think that the de/e rule is right; there are 
better grounds for believing that it says what it is meant to sa}^ — that 
is. nothing — than for the Bayes-Laplace rule. I should not regard the 
above as showing that dx/x(l— x) is right for their problem. Other 
transformations would have the same properties and would be mutually 
inconsistent if the same rule was taken for all. 

I think that at this point we come up against one of the imperfections 
of the human mind that have given trouble in the theory: that it has 
an imperfect memory. If everything that attracted its attention was 
either remembered clearly or completely forgotten it would be much 
easier to make a formal theory correspond closely to what the mind 
actually does, and therefore there would be less need for one. Data 
completely forgotten would then be totally ignored, and we know how 
to do that; those perfectly remembered could be used in the theory in 
the usual way. But the mind retains great numbers of vague memories 
and inferences based on data that have themselves been forgotten, and 
it is impossible to bring them into a formal theory because they are not 
sufficiently clearly stated. In practice, if one of them leads to a sugges- 
tion of a problem as worth investigating, all that we can do is to treat 
the matter as if we were approaching it from ignorance — the vague 
memory is not treated as providing any information at all. If the com- 
ment on a competent piece of experimental work, leading to a definite 



108 


ESTIMATION PROBLEMS 


Chap. Ill 


oonoluaion, is ‘Everybody knew that’, the answer is, ‘Yes, but nobody 
knew enough about it to convince anybody else.’ Now I am not at all 
sure that the difficulty about the Bayes-Laplace assessment is not of 
this kind. Is it a pure statement of ignorance, or has observational 
evidence, imperfectly catalogued, about the frequency of different 
sampling ratios in the past somehow got mixed with it ? Edgeworth and 
Pearson held that it was based on the observed fact that sampling ratios 
had been about uniformly distributed. This might appeal to a meteoro- 
logist studying correlations in weather, which do seem to be roughly 
uniformly distributed over the possible range, but hardly to a Mendelian. 
Again, is there not a preponderance at the extremes ? Certainly if wo take 
the Bayes-Laplace rule right up to the extremes we are led to results that 
do not correspond to anybody’s way of thinking. The rule dxjx{\—x) 
goes too far the other way. It would lead to the conclusion that if a 
sample is of one type with respect to some property there is probability 1 
that the whole population is of that type. 

It is at least clear that some special hypothesis is needed for quanti- 
ties that must lie between 0 and 1, for even if we try to obtain a rule 
by transforming the dvjv rule the transformation is not unique. A 
chance or a ratio in a population, if it is treated as unknown, is an 
adjustable parameter. Now our general considerations showed that an 
adjustable parameter usually presupposes a significance test that has 
excluded some suggested value. Is this so here ? It appears that it is. 
Naive notions of causality would make all population ratios either 0 or 
1. On our analysis such a suggestion would never be certain, but we 
must give it a finite prior probability at the outset. Not to do this goes 
too far in the opposite direction. Further, though it has been disposed 
of in many cases, there are, even in our present state of knowledge, many 
where it appears to be true; apples and oranges do not grow on the same 
tree. In genetics the suggested values are usually intermediate, such as 

J, and I; in such questions as bias of dice they may be ^ or What 
the suggested values will be in any specific case will depend on the cir- 
cumstances of the particular problem; we cannot give a universal rule 
for them beyond the common-sense one, that if anybody does not know 
what his suggested value is, or whether there is one, he does not know 
what question he is asking and consequently does not know what his 
answer means. But then the problem of sampling, as a pure estimation 
problem, is limited to the case where there is no suggested value and 
the prior probability has no singularities. Then there is no objection 
to the uniform distribution, and no other satisfying this condition has 



ESTIMATION PROBLEMS 


109 


5 8.1 

t 

ever been Beriously suggested, though there is something to be said for 


the rule 


P{dx\H) = - 


dx 


■n^{x{l—x)y 


With this hmitation, then, we may as well use the uniform distribution. 
Even at the present state of knowledge, sampling ratios do seem to be 
very uniformly distributed except for problems of certain specific t 5 rpes, 
where suggested values exist. It is not asserted that such a rule will 
hold for all time, nor can it if the work is done correctly. But w'e can 
test what the form suggested would lead to, and say that in the present 
state of knowledge that is good enough to be going on with. 


3.2. Sampling. At first I shall extend the Bayes-Laplace theory to 
the samphng of a finite population. The total number of the population 
is N, which will be the sum r+s of the theory of random sampling. 
But our problem is now to infer something about r, given N and the 
sampling numbers I and m. Hence we must treat N as given and replace 
s by N — r. Then the probability of the observed numbers, given N 
and r, will be ^ I (1) 

We have no information initially to say that one value of r, given N, 
is more likely than another. Hence we must take all their prior probabi- 


lities equal, and 


P(r\NH) = l/(iV+l). 


( 2 ) 


Then by the principle of inverse probability 

P{r 1 1, m, N, H) oc (3) 

factors independent of r having been dropped. But some value of r 
in the range 0 to inclusive must be the right one, whence 

XP{r\l,m,N,H) = 1 (4) 

r =«0 

and P{r 1 1, m, N, H) = f (6) 

' r =0 

The summation is done by algebraic methods in Scientific Inference. 
A simple alternative way of doing it, suggested to me by Dr. F. J. W. 
Whipple, is as follows. Suppose that we have a class of A+l things 
arranged in a definite order, and that we wish to select 1-fm+l. This 
can be done in ways. But we may proceed as follows. First 

select an arbitrary member of the class; let it be the (r-f l)th in order. 
From the remainder we may select I firom those before the (r+l)th 
and m from those after it in ways. But we might choose any 

value of r, and all selections for different values of r are different. 



110 ESTIMATION PBOBLEMS Chap. Ill 

since the (r-t-l)th of the class must be the (Z+l)th of the sample. 
Hence ^ 

(6) 

r=0 

If the sample is large and N is large, the application of Stirling’s 
formula leads to the approximation 

r I 

where n — Z+m; p = lln\ 6 — ^ . (8) 

N n 

Thus d measures the departure from proportionality. Its probability 
is distributed about 0 with standard error {(N—n)p(l—p)/nNyi^, which 
approaches {p(l— if the population is large compared with the 
sample. This might be expected from the corresponding result in the 
direct problem. Further, if JV/n is large the probability of Z/w given 
r/N is nearly independent of N. The sample can therefore give us no 
information about the size of the population, and the latter is irrelevant 
to r/N given the sample, when N is large. But if N is such that we must 
take into account the difference between N—n and N, the standard 
error of 0 is a little smaller than for a larger population; the solution 
for the latter would also be applicable to problems of sampling with 
replacement or of estimating chances. This represents really only the 
fact that we regard the sample as part of the population, and our defi- 
nite knowledge of it reduces the standard error of the ratio for a finite 
population of which the sample is a part. 

This may be seen by considering the probability that the next 
specimen will be of the first type. The population being of number N, 
of which n have already been removed, and the members of the first 
type being r in number, of which Z have been removed, the probability 
of the proposition p, that the next would be of the type, given r, N and 
the sample, is j 

P{pll,m,N,r,H) = (9) 

Combining with (5) by the product rule, 

P(r,p 1 l,m,N,H) = ^rcN-rcjNnc^^^, ( 10 ) 

The total probability of p on the data is got by summing over all values 
of r. But 

r-l /•! _ (l+l)r! nn 

N-nl\{x-l)\ {iV^-w)(Z-|-l)!(r-Z-l)! N~n ’ 



§3.2 


ESTIMATION PROBLEMS 


111 


and 

Hence 


N 


— 0 


P(p\l,m,N,H) 


i±l _ i±i _ ^+1 

N — w ^■'■^C'n+i w+2 Z+m+2’ 


(12) 

(13) 


which is independent of N. It is usually known as Laplace’s rule of 
succession. t Neither Bayes nor Laplace, however, seems to have con- 
sidered the case of finite N. They both proceed by considering a chance 
X, which would conespond to r/N, taking the prior probability of x 
uniform between 0 and 1 , and using the binomial law for the likelihood. 
The formal result is naturally the same; but I think that the first person 
to see that the result is independent of N was Professor C. D. Broad. J 
Having got so far, we ean see at once that the probability, given the 
sample, that the next n' will consist of V of the first type and n’—l' 
of the second is also independent of N. For we can construct in turn 
the probabilities of the second further member being of the type, given 
the sample and the (n-(-l)th, of the third given the sample and the 
(n-f l)th and (n4-2)th, and so on indefinitely. AH of these are indepen- 
dent of N, and the probability of a series of I' and m' in any prescribed 
order will be built up by multiplying the results. This is found to be 


(i-f l)((-f-2)...(l4-^')("^4-l)(w+2)...(7n-|-7n’) 


irrespective of the order; and the number of possible orders is 
Hence the probability given the sample that the next mtU con- 

tain just V of the first type, in any order, is 






This leads to some further interesting results. Suppose that m = 0, 
so that the sample is all of one type. Then the probability given the 
sample that the next w'ill be of the type is which will be 

large if the sample is large. The probability that the next V will all be 
of the type (m' = 0) is Thus given that all members 

yet examined are of the type, there is a probability ^ that the next 
i-j- 1 will also be of the type; a result given by Pearson by an extension 
of Laplace’s analysis. But if I' ~ N—l, the result is (Z-t-l)/(iV^-f-l). 
This can be obtained otherwise. For 1' = N—l is the proposition that 
the entire population is of the same type, and is equivalent to r = N. 


t de I’Xoad. R, d. iSct.,Pari8, 6, 1774, 621 ; (Euwrs CoinpMtes, 8, 30. Curiously, it 
is not reproduced in the Thiorie Analytique. f Mind, 27, 1918, 389-404. 



112 


ESTIMATION PROBLEMS 


Chap, m 


But P(r = iVr 1 1, m, N, H) = ^ . (16) 

It follows that with the uniform distribution of the prior probability 
(1) a large homogeneous sample will establish a high probability that 
the next member will be of the same type, and a moderate probability 
that a further sample comparable in size with the first sample will be 
of the tyX)o, (2) sampling will never give a high probability that the 
whole population is homogeneous unless the sample constitutes a large 
fraction of the whole population. 

3.21 . The last result was given by Broad in the paper just mentioned, 
and was the first clear recognition, I think, of the need to modify the 
uniform assessment if it was to correspond to actual processes of induc- 
tion. It was the profound analysis in this paper that led to the work 
of Wiinch and myself. I We showed that Broad had, if anything, under- 
stated his case, and indicated the kind of changes that were needed to 
meet its requirements. The rule of succession had been generally 
appealed to as a justification of induction; what Broad showed was that 
it was no justification whatever for attaching even a moderate proba- 
bility to a general rule if the possible instances of the rule are many 
times more numerous than those already investigated. If we are ever 
to attach a high probability to a general rule, on any practicable amount 
of evidence, it is necessary that it must have a moderate probability 
to start Avith. Thus I may have seen 1 in 1,000 of the ‘animals with 
feathers’ in England; on Laplace’s theory the probability of the pro- 
position, ‘ all animals with feathers have beaks’, would be about 1/1000. 
This does not correspond to my state of belief or anybody else’s. We 
mi g ht try to avoid the difficulty by introducing testimony, through the 
principle that if there were animals with feathers and without beaks, 
somebody would have seen them and I should have heard of it. This 
is perhaps questionable, but it only shifts the difficulty, because it 
raises the need to consider the proposition, ‘aU other people mean the 
same thing by words as I do’, and this would itself be an inductive 
generalization as hard to accept, on Laplace’s theory, as the first. The 
fundamental trouble is that the prior probabilities l/(N-f-l) attached 
by the theory to the extreme values are so utterly small that they 
amount to saying, without any evidence at all, that it is practically 
certain that the population is not homogeneous in respect of the 
property to be investigated; so nearly certain that no conceivable 

t Phil. Mag. 42 , 1921, 369-90; 45 , 1923, 368-74. 



ESTIMATION PKOBLEMS 


113 


J 3.2 

"i 

amount of observational evidence could appreciably alter this position. 
The situation is even worse in relation to quantitative laws, as Wrinch 
and I showed; the extension to continuous magnitudes would make the 
probability that a new parameter suggested is zero always genuinely 
infinitesimal, and there would be no way out of the difficulty considered 
on p. 103. Now I say that for that reason the uniform assessment must 
be abandoned for ranges including the extreme values, by rule 6 and 
by the considerations already quoted from Pearson. An adequate 
theory of scientific investigation must leave it open for any hypothesis 
whatever that can be clearly stated to be accepted on a moderate amount 
of evidence. It must not rule out a clearly stated hypothesis, such as 
that a class is homogeneous, until there is definite evidence against it. 
Similarly, it must not rule out a quantitative law stated in terms of 
a finite number of parameters. But this amounts to enunciating the 
principle : Any clearly stated law has a finite prior probability, and therefore 
an appreciable posterior probability until there is definite evidence against 
it. This is the fundamental statement of the simphcity postulate. The 
remarkable thing, indeed, is that this was not seen by Laplace, who in 
other contexts is referred to as the chief advocate of extreme causality. 
Had he applied his analysis of sampling to the estimation of the com- 
position of an entire finite population, it seems beyond question that 
he would have seen that it could never lead to an appreciable probability 
for a single general law, and is therefore unsatisfactory. 

The admission of a probability for the extreme values that remains 
finite however large the population may be, leads at once to satisfactory 
results. For if we take 

P(r = 0 1 NH) = P{r = N\ NH) = k (17) 

and distribute the remainder I— 2k uniformly over the other values, 
we shall have 

= (r^0,N). (18) 

For k = 1/(N+1) this reduces to Laplace’s rule. Now if the sample is 
not homogeneous the extreme possible values of N give zero probability 
to the sample, and are therefore excluded by the data; while for com- 
parison of intermediate values the new prior probability merely gives 
an irrelevant constant factor and leaves the result as it was before. Thus 
the results derived from a mixed sample will need no change. 

But now suppose that the sample is all of the first type, so that 

I = n. r = 0 is now excluded by the data, but we want the revised 

86a&.68 T 



114 


ESTIMATION PROBLEMS 


Chap. Ill 


posterior probability that r = N. This can be derived easily. For the 
likelihood factors are unaltered, and for r ^ N the ratios of the prior 
probabilities are unaltered. Therefore we need only consider the two 
alternatives r = N and r 0, N, multiplying the previous posterior 
probabilities in the same ratio as the prior probabilities. The former 


n-\-\ N — n . . , , 1 j 

were and ; the previous prior probabihties were and 


the new prior probabilities k and \~2k. Hence, now, 

P{r=^N\l = n,N,H) n+1 k N-1 
P(r^N\l = n,N,H)'~N^\-2k 1 ' 


(19) 


Hence if n is large, the ratio is greater than (n+l)kj(l — 2k) whatever N 
may be, and the posterior probability that r = N will approach 1 as 
the sample increases, almost irrespective of N, as soon as n has reached 
Ijk. We may notice that if ti = 1, the ratio is 2kj(l — 2k), which is 
independent of N if ifc is. 

The best value to take for k is not clear, but the following considera- 
tions are relevant. If k — it says that W'e already know that r = 0 
or N; hence this is too large. If = 1/(W-|-1), we recover the result 
from the uniform assessment, and this is too low’, k = ^ gives the ratio 
N—l 

, which = 1 if » = 1; this would say that a generalization 

on one instance has probabQity which is not unreasonable. The 
trouble here is that on the uniform assessment, if iV = 2, k is already 
so that A; = J is too low in this case. If we are to make a general rule 
independent of N we are therefore restricted to values of k between J 
and A possible alternative form would be to take 


* 4'^2(A+1)’ 

which puts half the prior probability into the extremes and leaves the 
other half distributed equally over all values, including the extremes. 
The basis of such an assessment would be a classification of the possi- 
bilities as follows: (1) Population homogeneous on account of some 
general rule. (2) No general rule, but extreme values to be treated on 
a level with others. Alternative (1) would then be distributed equally 
between the two possible cases, and (2) between its n + 1 possible cases. 
This is in accordance with the principles of significance tests, which 
will be developed later. For iV = 2 it gives k — leaving ^ for the 
prior probability that the two members are unlike. For N large it 



§3.2 


ESTIMATION PROBLEMS 


■« 

gives the ratio of the posterior probabilities 


«.+ ! i^+3 

2 N-n’ 


116 

which seems 


satisfactory. It is possible, therefore, to give assessments of the prior 
probability that avoid the difficulty found by Broad. The solution 
would be suited to a case where it is still a serious possibiUty that the 
class is all of one type, but we do not know of which type. 

A partial solution has been given by Pearson, f ‘ Suppose the solidi- 
fication of hydrogen to have been once accomplished. . . . What is the 
probability that on repetition of the same process the solidification of 
hydrogen will follow ? Now Laplace has asserted that the probability 
that an event which has occurred p times and has not hitherto failed 

will occur again, is represented by the fraction 

case of hydrogen, the probability of repetition would be only f, or, as 
we popularly say, the odds would be two to one in its favour. On the 
other hand, if the sun has risen without fail a million times, the odds 
in favour of its rising to-morrow would be 1,000,001 to 1. It is clear 
that on this hypotiiesis there would be practical certainty with regard 
to the rising of the sun being repeated, but only some likehhood with 
regard to the solidification of hydrogen being repeated. The numbers, 
in fact, do not in the least represent the degrees of belief of the scientist 
regarding the repetition of the two phenomena. We ought rather to 
put the problem in this manner; p different sequences of perceptions 
have been found to follow the same routine, however often repeated, 
and none have been known to fail, what is the probability that the 
(p -f 1 )th sequence of perceptions will have a routine ? Laplace 's theorem 
shows us that the odds are p+l to 1 in favour of the new sequence 
having a routine. In other words, since p represents here the infinite 
variety of phenomena in which men’s past experience has shown that 
the same causes are on repetition followed by the same effect, there are 
overwhelming odds that any newly observed phenomenon may be 
classified under this law' of causation. So great and, considering the 
odds, reasonably great is our belief in this law of causation applying to 
new phenomena, that w hen a sequence of perceptions does not appear 
to repeat itself, we assert with the utmost confidence that the same 
causes have not been present in the original and in the repeated 
sequence.’ Here Pearson goes far to anticipate the difficulty raised by 
Broad, in fact too far, for he almost says that exact causahty has been 
established in general by inductive methods. But he has given one 


t The Orammar of Science, 1911, p. 141. Everyman edition, p. 122. 



lie 


ESTIMATION PKOBLEMS 


Chap. Ill 


essential point, by transferring the Laplacean inference from simple 
events to laws. If routines have been established in half the cases 
already examined, that is adequate ground for attaching a prior proba- 
bility ^ that there will be a routine in a new case. If it has been found 
that all pure substances yet examined have fixed freezing-points, the 
p-f-l to 1 rule would apply as it stands, p being now the number so 
far tested. The weakness of the argument is that each of the previous 
cases of routine has involved an induction from a finite number of 
observations to a general law, and if we started with the Laplace assess- 
ment we should never be able by induction to attach a high probability 
to even one general law. Pearson’s argument, with the above modifica- 
tion, is highly important in relation to present procedure, but the type 
of assessment (20) is needed at the outset in any case. 

3.22. In what follows Dirichlet integrals are used several times. As 
they are usually expressed in the P notation, and I find the factorial 
notation more convenient (it is also adopted in the Briti.sh Association 
Tables), the main formulae are given at this point. 

For m; variables all between 0 and 1, 


/// ••• / dzy.dx^ (0 ^ 2 1 ) 

JJ J ■■■ J dxi...dz^ (0 < 2 ^ 1) 


1 




, ( 3 ) 




xl) dxi...dx^ 






For Zj = Zj = ... = Z„ = 0, (2) reduces to 1/wl. 
For ... = = 0, p = 2, (3) becomes 


2“’(Jtc)! • 



ESTIMATION PROBLEMS 


117 


§ 3.2 

If negative values of the z’s are admitted, this is multiplied by 2“’. 
This gives what is often called the volume of a to-dimensional sphere 
of radius 1 . That of a tiJ-dimensional sphere of radius c is therefore 


TrViw 


c". 


(6) 


which reduces to tto* for w = 2, and |vc® for w — 3, a,s it should. 

3.23. Multiple sampling. When the class sampled consists of 
several t 3 ^e 8 we can generalize Laplace’s assessment, with similar 
provisos to those needed in the simpler case. Suppose that the whole 
number of members is n, divided among r t 3 rpes, the numbers of the 
respective types being m^, wij,..., m,. Then we say that all compositions 
are equally probable. The number of ways of dividing n things into 
r classes is {n-\-r — l)!/n!(r — 1)!; but is determined when the rest are 
known, and can therefore be omitted by Axiom 6. Hence 


P(mj, wij,..., jwH) = n!(r— l)!/(n+r— 1)!. (1) 

Of these possibilities, if mj is considered fixed, the number of partitions 
among the others is the number of ways of dividing things into 

r— 1 classes, which is (n— Wj+f'— 2)!/(«— mi)!(r— 2)!. Hence for by 

P(mi 1 nH) = . (2) 


(n+r— 1)! (n— m^)! 


If n is very large, put m.^ — npi, and so on. The proposition that 
has a particular value becomes the proposition that is in a particular 
range dpi of length l/n. Then 

P(dpidp^...dp,_^\nH) = n'-^ dpi...dp,_^ i 

-^(r-l)\dp^...dp,_i. (3) 

Here n has disappeared and need not be considered further. This gives 
the distribution of the joint prior probability that the chances of the 
various types lie in particular ranges. For p^ separately we can approxi- 
mate to (2), n—m^ being large compared with r, or integrate (3). Then 


P(dpi I H) = (r~l){l—piY-^dpi. (4) 


In (4) the probabUity of p^ is no longer uniformly distributed as on 
Laplace’s assessment. This expresses the fact that the average value 
of all the p’s is now 1/r instead of | as for the case of two alternatives; 
it would now be impossible for more than two of them to exceed J. 
But if all but two of them are fixed the prior probability is uniformly 
distributed between these two. 

Suppose that we have made a sample and that the numbers of various 



118 


ESTIMATION PROBLEMS Chap. IH 


types are x^, Xg,..., x^. The probability of the sample, given the p’s and 
the actual order of occurrence, is whence, by (3), 

1 0ff) oc pfK..p^'dpi...dpr-i, (6) 

factors independent of the p’s having been dropped. Integrating with 
respect to all p’s except p^, the sum of the others being restricted to 
be less than (1— Pi), we have {6 denoting the observed data) 


P(dpj I dH) cjc 


x^\...x^] 


{x^+...+x,+r-2)\ 
oc pj‘(l— Pi)®*+-+“i-+’‘“®dpi. 


pf ( 1 __p dpj 


( 6 ) 


But if we are given only Pi, the probabUity of getting x^ of the first type 
and 2 x—Xi of the others together is pf*(l—pj)2^-^®'; and combining this 
wfith (4) we get (6) again, the factor r— 1 being independent of pj. Hence, 
if we only want the fraction of the class that is of one particular type, we 
need consider only the number of that type and the total of the other 
types in the sample. The distribution among the other types is irrelevant. 

By a similar analysis to that used for simple sampling it is found that 
the probabihty, given the sample, that the next member chosen will be 
of the first type is 


J. 




^JLx-x,+7-2 = 


Xi+l 

2a;+r’ 


(7) 


W. E. Johnson,^ assuming that distribution among the other types is 
irrelevant to the probability of p^, and working entirely with the 
posterior probability, has shown by an ingenious method that the 
probability that the next specimen will be of the first type is hnear in 

x^. Hisformula, in the present notation, is (u’Xi+l)/(w 2 z+J"). w is not 

evaluated; (7) shows that in the conditions considered here w = 1. 

The conditions in question in fact assume that information about the 
proportion of the class that is of one type is irrelevant to the ratios of 
the numbers of the other types. They would apply to an estimation of 
the proportions of blue, white, and pink flowers in Polygala vulgaris. 
We may call this a simple statement of alternatives. If the class falls 
into main types, according to one set of properties, each of which is 
subdivided according to another set, and the ratios within one main 
type give no information about those in another, the result needs some 
change, as we shall see for a 2 x 2 classification in § 6. 1 1 . The numbers of 
the main t 5 rpe 8 can then be estimated according to Laplace’s rule and 


t Mind, 41 , 1932, 421-3. 



§ 3.2 


ESTIMATION PROBLEMS 


119 


the distribution within each according to that just given. The difference 
arises from the fact that the discovery that several subtypes of the same 
main type are rare will give some inductive ground for supposing that 
other subtypes of that type are also rare: there is no longer complete 
independence apart from the bare fact that the sum of all the chances 
must be 1. 


3.3. The Poisson distribution. The derivation of this law suggests 
an analogy with sampling, but there is a difference since the one para- 
meter involved is capable of any positive value. It is the product of a 
chance known to be small in any one trial, and the number of trials, 
which is large. We might try to regard the problem of radioactivity, 
for instance, as one of sampling, the problem being to estimate the 
fraction of the atoms in the specimen that break up in the time of the 
experiment. But this is not valid because the size of the specimen and 
the time of the experiment are themselves chosen so as to make the 
expectation large; we already know that the fraction that break up is 
small but not zero. This must be expressed by a concentration of the 
prior probability towards small values. It is not covered by either the 
uniform assessment or the suggestion of a finite concentration at 0. 
The fundamental object of the work is to estimate the parameter a in 
the formula e~°^, which represents the fraction of the atoms originally 
present that survive after time t. This parameter is not a chance but 
a chance per unit time, and therefore is dimensional; thus the correct 
prior probability distribution for it, given that it must lie between 0 
and 00 and is otherwise unknown, is daja. In the dust counter, similarly, 
the fundamental parameter is the number of particles per unit volume, 
which again is dimensional; but it might appear equally legitimate to 
use the mean volume per particle, and the drjr rule holds, though pos- 
sibly with a slight modification to take account of the fact that the air 
cannot be all dust. In the problem of the soldiers killed by horses a 
time factor again enters. It appears best, therefore, in problems where 
the Poisson law arises, to take the prior probability 

P(dr i H) oc drjr. (1) 


Also given r, the chance that the event will happen m times in any 
interval is P{m\rH) = r”‘e-'lm\. (2) 


The joint chance for several intervals is therefore 


P{m^,mz,...,7n„\rH) = 


mg! ...m,„! 


(3) 



120 


ESTIMATION PROBLEMS 


Chap, m 


and, omitting factors independent of r, we have 


P(dr| 




m^,H) oc — 


(4) 


The probability, given the observations, that r is in any particular 
range is given by the incomplete F function.! We notice that the only 
function of the observations that appears in the posterior probability 
is 2 which is therefore a sufficient statistic for r. The utility of further 
information about the individual m's is that they may provide a check 
on whether the Poisson law actually holds, or whether, for instance, 
there is a deviation in the direction of the negative binomial. The 
expectation of r, given the data, is at m = 'm)jn-, the maximum 

probability density is at a shghtly smaller value, and the standard error 
(rhjnfl^ if 2 is large. 


3.4, The normal law of error. We consider first the case where the 
standard error is known, but the true value x is unknown over a wide 
range. Then a is part of the data H, and 

P{dx\H)ccdx. (1) 

Also the joint chance of all the observations is 


P(dx^...dx^\x,H) = ^^-^expj^_^J(f-a;)2+fi'2) 

Hence, omitting factors independent of x,l 

P{dx\Zi,X 2 ,...,x^,H)<x: exp| — ^(a:— x)2|da; 


dxidx2...dx^. 

( 2 ) 


t J. B. S. Haldcoie, Proc. Camb. Phil. Soc. 28, 1932, 68. This paper contained the use 
of the dvjv rule for the prior probability in such cases, at a time when I had considered it 
only in relation to a standard error ; also the concentration of a finite fraction of the prior 
probability in a particular value, which later became the basis of my significance testa. 

t It is understood that dx in the sign P(cix | ...) is an abbreviation for a proposition, 
namely that a quantity f whose probability distribution is being considered lies in a 
particular range x to x+dx. In the data x, H of (2), x is used as an abbreviation for the 
same proposition; but it is convenient to abbreviate the same proposition in different 
ways according as it appears in the data or in the proposition whose probability is being 
considered. The reason is that in (1) or (3) P(dx ) ...) is an element of a distribution, and 
the differential calls attention to this fact and appears explicitly on the right ; but in (2) 
the variation of x in an arbitrarily small range contributes arbitrarily little to the right 
side, and we need attend only to the value of x. This method of abbreviation lends itwlf 
to immediate adaptation to integration : 

J P(dx I q) = P(Xi <x <x^\q). 



§3.4 ESTIMATION PROBLEMS I2I 

SO that the posterior probability of x is normally distributed about x 
\rith standard error ct/Vm. 

In practical cases there is usually some previous information relevant 
to X. Perhaps the discovery of a new star (nova) affords the simplest 
example. The original discovery is a non-quantitative observation, 
often a naked-eye one, but by comparison with neighbouring stars it 
gives enough information to enable the observer to identify the new 
star again. It may be enough to specify the position within 1°, but 
later measurements may have a standard error of the order of 1". 
Then (1) should strictly be replaced by 

P{dx I U) = f{x) dx, 

where /(x) is very small if x is not within a particular range of order 1°, 
and within this range f{x) varies slowly. But then we get 

P{dx I x.^...x,„H) cc /(x)exp| —^{x-x)^ dx. 

X is within the range where /(x) is appreciable (otherwise the accurate 
observer would be observing the wTong star) and the exponential factor 
is negligible if |x— xl is more than about 3". In this range we can 
neglect the variation of /(.r), and on adjusting the constant factor we 
are led again to (3) with a high degree of accuracy. In such cases 
the original information is not contradicted by the new evidence, but 
is superseded in the sense that when the latter is available the effect 
of the original information on the result is negligible. Similar considera- 
tions can arise in most of the problems of this chapter and the next, 
and we shall not usually call special attention to them. 

3 . 41 . If the standard error is unknown, its prior probability must 
be proportional to da/cr, partly because it is usually dimensional and 
might be either very large or very small, partly because we might 
equally well take the precision constant as our standard of accuracy. 
Also we need not suppose that any previous knowledge of x would tell 
us anything directly about a, so that the prior probabilities of x and a 
may be taken independent. Then 

P{dxda \H)a: dxdcrja. ( 1 ) 

The likelihood factor is the same as before; hence 

P{dxda |Xj,X 2 ,...,x„,.^) oc CT~"“^expj^— ^{(x— x)®-t-5'®}j dxda (2) 
and the constant factor is 



122 


ESTIMATION PKOBLEMS 


Chap. Ill 


We notice here the immediate representation of the posterior proba- 
bility in terms of the sufficient statistics x and s'. All the other factors 
depending on the observations are the same for all values of x and a, 
and therefore cancel from the posterior probability. 

To get the posterior probability of x by itself we have only to inte- 
grate with regard to a. We have 

CD 

P{dx\xy,x^,...,x^,H) cc dx j ^-"-lexpj^ — ^{(ar— f)2-f da (3) 

0 

which becomes, on putting 


CO 

P(dx \ x^,X 2 ,...,Xj^,H) cx: ^ j .{s'^+{x—x)^}-'‘>'^" dx. (5) 

0 

Only the last factor involves x. Determining the constant factor by the 
condition that — c» < a; < oo, we have 


P(dx I x^,Xz,...,x„,H) = ^ 


v'n-l 


Vtt I)! {s'^-|- (a;— 

The right side is identical with ‘Student’s’ rule in form. 
Integrating (2) with respect to x we get 


— dx. 


P{da \ x^...x„,H) cc a-”exp^ — ^^jdcr. 
If » = 2, and we put x—x = s' tan we get 


( 6 ) 


( 7 ) 


P{d<f> \Xi,X 2,H) = -d(f>. 


( 8 ) 


But in this case s' is simply the distance of either observation from the 
mean, and the values <f> = give, respectively, x — x^ and x = Xg. 

P(Xi < X < Xg I a?!, Xg, H) = (9) 

That is, given just two observations the true value is as likely as not to 
lie between them. This is a general result for any type of law that 
involves only a location and a scale parameter, both of which are 
initially unknown. The latter condition is necessary. If, for instance, 
H contained information about the standard error, and the first two 
observations differed by 4 <t, there would be a high probability, given 
these observations, that the true value was about midway between 
them, and then the probability that the true value was between them 



ESTIMATION PROBLEMS 


123 


§ 3.4 


would be more than If they differed by on the other hand, we 
should interpret this as an accidental agreement and the probability, 
given the observations, that the true value lay between them would be 
less than It is only when the observations contain the whole of the 
information available about a that the probability, given them, that 
the true value lies between them can be the same for aU possible 
separations of the observations. 

If » = I, X ~ and s' = 0. Then returning to (2) 


P{dxda I Xj^,H) oc or“*“exp 
Integrating with regard to a we get 

P{dx 1 Xi, H)cc - 




dxda. 


dx 


( 10 ) 

( 11 ) 


that is, the most probable value of x is a:^, but we have no information 
about the accuracy of the determination. (7) gives for a 


P{da \Xi,H) ac da/a, (12) 

that is, we still know nothing about a. These results were to be expected, 
but attention to degenerate cases is often desirable to verify that the 
solution does degenerate in the right way. 

It is easy to show that, with the distribution of probability given in 
(6), the expectation of {x—x)^ is, for » > 3, 


= 2 (^r— 

n — 3 n{n~Z) ’ 


(13) 


and is infinite if n is less than 4. At small numbers of observations the 
departure of the posterior probability from normality is great. 

There is, however, the following peculiarity if two sets of observa- 
tions with different standard errors a, t are relevant to the same x. 
We should here take 

P(dxd(jdr I H) oc dxdcrdTjar, 


P{9 1 X, a, r, H) oc cr~“T”" exp 




Combining these and integrating with regard to a and t, we get 

P{dx I 6H) oc (a:— x)*}~'/2”*{f'®-|-(a:— dx, (14) 


and the expectation of x® converges even if m == » = 2. The integral 
needs complicated eUiptic functions to express it if m and n are odd, 
and in general is not expressible compactly. If wi = 1, » = 2 we find 
that the posterior probability has a pole at x, but the expectation of 



124 


ESTIMATION PROBLEMS Chap. UI 


(x — x)* is infinite; this means that neither very small nor very large 
values of a are yet effectively excluded by the data. 

If an estimate has standard error a, a~^ or some number proportional 
to it is called the weight. If x^.x^,... are a set of estimates of x, with 
weights Wy,w^,..., the most probable value of x is given by 

and if unit weight corresponds to standard error 1, the standard error 
of the estimate is This additive property of weight often 

makes it convenient to express the standard errors in terms of it. The 
standard error, itself, however, has an additive property. If Xj and Xg 
have independent standard errors and then the standard error of 
Xi 4-^2 or of Xj— Xj is corresponding weight is 

WiWjiWi + W^). 

The usual practice in astronomical and physical work is to multiply 
the estimated standard error by 0‘6745 and call the result the ‘probable 
error’. But this multiplication, which has little point even when the 
probability considered is normally distributed, is seriously wrong when 
uncertainty is estimated from the observations themselves. Writing the 
usual estimate of the standard error in the form 


(Xr-X)y i^ 
n{n—l)j 


and t = (x— x)/®^, 

we find as for 2.8 (21) 

p{dt\eH)ocli+^\ dt, 


(16) 

(17) 

(18) 


which is not normal. We have already seen that for w = 2 the proba- 
bility that X is between x±:Sg is J, so that the probable error in the 
sense defined for the normal law is equal to the standard error. For 
risks of larger error the difference is greater. P being the probability 
of a larger t (positive and negative errors being taken together) we have 
the following specimen values, from Fisher’s table. 


n\P 

0-5 

01 

005 

001 

2 

1000 

6-314 

12-700 

03-657 

6 

0-727 

2-132 

2-770 

4-604 

10 

0-703 

1-833 

2-202 

3-260 

20 

0-088 

1-720 

2-093 

2-801 

00 

0-074 

1-046 

1-000 

2-576 


The values depart widely from proportionality, and a statement of 
uncertainty based on only a few observations is useless as a guide to 
the risk of large error unless the number of observations is given. 



§ 3.4 


ESTIMATION PROBLEMS 


126 


In many statements of the results of physical experiments, besides 
the omission of explicit statement of the numbers of observations in the 
final conclusion, the uncertainties stated are often rounded to one 
figure; I have actually seen a ‘probable error’ given as 0-1, which 
might mean anything from 0-05 to 0-15. Suppose then that two esti- 
mated standard errors are both given as O-l, but one means 0-05 on 
20 observations, the other 0-15 on 2 observations; and that we want to 
state limits such that there is a probability 0-99 that the true value lies 
between them — which we might quite well want to do if much depends 
on the answer. The limit in one case would be 0-14, in the other 9-5. 
In fact if anybody wants to reduce a good set of observations to 
meaninglessness he can hardly do better than to round the uncertainty 
to one figure and suppress the number of observations. 

It is generally enough to give two figures in the estimated standard 
error. Karl Pearson usually gave many more figures, often six or seven, 
and statisticians still usually give four, but I consider more than two 
a waste of labour. It is not often that a result of importance depends 
on whether the standard error is 0-95 or 1-05. 

3 . 42 . The following problem is liable to arise in practice. Given one 
set of observations derived from the normal law, say Xj to and no 
other information about x and cr, what is the probability that a new 
series of observations will give a mean or a standard deviation in a 
particular range? We have, from 2.8(15), 


Pidx^ds'^ I X, a, H) 

^ exp(- .^(x,-x)^l ,exp(-!^) 

/sJyZTrja 2 ct 2 ' ^ “ 

( 1 ) 


and from 3.41 (2) 
P(dxda I Xj,...,x„,H) 


whence 




i„.H) = 


X Mp[ - ^{(»-%)’+«i') - 

But 

”i(*-^i)®+«2(a:--Xg)* = (n^+nJx — ^ 

\ Wi-fTC2 / 




W-j 7^2 

Wl + Wg 


(Xi— Xa)® 


( 4 ) 



126 


ESTIMATION PBOBLEMS 


Chap, in 


and integration with regard to x gives 
F(datlx2 dss I Xi,..., x„^, H) 


X 


■ 2Va»i+V2»M-»/aV7r(^Wi— I)! — f )! 

X exp j — (^ 2 -^i)^jexp j - s'^+n^ s;**)! dadx^ds'^. (5) 


If we now integrate with regard to a, a factor 




+«i4 


2o'2 I ^1^2 


(*2— ^l)*| 


-V^ini+nt-l) 


( 6 ) 


% + W2 

will arise. This does not separate into factors. Hence, given and «(, 
the probability distributions of and 4 are not independent; though 
they are independent given x and a. What this means is that if 4 i® 
unusually large in comparison with a, we shall overestimate the scale, 
and this overestimation will affect the estimates of the ranges likely 
both for Xj and 4- But if we are interested in only one of them we can 
integrate with regard to the other. Then 

P(dodx2\Xi,...,x^,H) 

/ 71, m,(xo— x,)^ 

iVi) ! -^1 

m 


2V2m-iV77(|7i. 

-B(dx2 j Xj,..., x^, H) 

Also 

P(dad4 iXi,...,x„,H) 

74/= 

2V3^i+V2n*- 


n, 


Va 


(711+712)*'=' V7r(i77 


(i77i— 1)! f 7l2(X2-Xi)2)-'/=«-dx, 


(7li + 7l2)4® 


( 8 ) 


f 7?-i 4*4^2 4*\ j j ' ,n\ 

( 10 ) 

and on putting 4 = 4/y> ( ^ ^ ) 

we recover the form 2.81 (24), and the z distribution follows. 

3.43. If X2,...,x,..^i are the means of r further sets of 7^2 observations 
each, «2v) 4+1 corresponding mean square deviations, the rule holds 
for each separately and independently. Hence 

P(dX2...dx,+i |x,« 7,H) = — 

(12) 



§ 3.4 

Now put 


ESTIMATION PROBLEMS 


127 


2 ( 13 ) 

the exponent becomes — ^((Z— a;)^+/S*} (14) 

2 ( 7 ^ 

and (12) is of exactly the form of 2.8(4), with r written for n and 
a/V «2 for <7- Hence (10) and the z rule are adapted immediately to give 
the probability distribution of S given to We need only replace 
Wj by r and 4® by Wa 

This form is more closely analogous to the way in which the z rule 
is used in agricultural experiments. In them the means of plots with 
the same treatment are taken, and the sum of the squares of the 
differences between the treatment means and the general mean gives 
n^rS^ is called the treatment sum of squares. The differences 
not explicable by treatments or other systematic effects are used to 
provide Then, given 4 and the hypothesis of general randomness, 
the method will give the probability distribution of S. If the observed 
value is such that it would be very unlikely to occur on this hypothesis, 
given 4' then the hypothesis is rejected and the existence of treat- 
ment differences asserted. In Fisher’s form would correspond to the 
random variation and to the possibly partly or mainly systematic one, 
hence his convention that .Sj > .Vj. It is easy to see that interchanging 

and Sj with and Sj, and reversing the sign of z, leaves 2.81 (26) 
unaltered. 

These results were obtained by Mr. W. 0. Storer in an unpublished 
paper, based on a suggestion of mine that the conditions that lead to 
the similarity between ‘Student’s’ result and mine seemed to be fulfilled 
also in the circumstances considered by Fisher in deriving the z distribu- 
tion. Hence I expected that the probability distribution of log(sJsi), 
given one set of observations, would agree exactly with that derived 
from Fisher’s formula; and Storer found tliis to be the case. 

3.44. A closely related problem, which will serve as an introduction 
to some features of the method of least squares, is where we have to 
estimate m unknowns a:,, (r = 1 to to), to each of which a set of m, 
measures x^ (i = 1 to n,.) is relevant. The standard error of one 
observation is supposed to be the same in aU series. Put, S denoting 
summation with regard to i, S with regard to r, 

= Sx^, ^,42 = S(x^—x^)^. ( 1 ) 

Then, denoting the observations collectively by 8, 

P(dxj^...dx„da IH) oc dxi...dx^dalcr. 


( 2 ) 



128 


ESTIMATION PROBLEMS 


Chap. Ill 


P(,d I x^...x^aH) oc n exp|^- , 


(3) 


P(dx^...dx^da I eH) oc ^ 

By integration 

P{da I BH) oc a~^'^+’""^exp| — ^ 
and if we now put ^ 


P{da I BH) oc g-(£ exp| _ da. 


(4) 

(5) 

( 6 ) 
(7) 


This is of the same form as 3.41 (7), if in the latter we replace ns'^ by 
{n — l)s® and then replace n — 1 by ^n^—m. In the former problem 
71— 1, in the present one 2 n^—m, is the difference between the whole 
number of observations and the number of true values to be estimated. 
Hence it is convenient to call this difference the number of degrees of 
freedom and to denote it by v, and to give the name standard deviation 
to « in both cases. Then however many unknowns are estimated we 
always have , ^ , 

P{da I BH) oc a-<'’+w exp - da, (8) 

and the posterior probability distribution of ojs is given by a single 
set of tables. 

(4) can now be written 

P{dxy...dx^da\BH) oc a-^"--iexpj — ^ IT 


Integrate with respect to x^,...,Xj^-, then 


(9) 


P(da:ida I AH) oc ff-2”'+’"-2exp|—^{Xi—fi)2 — ^ dx^da, (10) 


Put 

then 


P{dx^ 1 BH) oc — Xi)^}~^^^''+^^ dxj^. 


( 11 ) 

(12) 


/ ( q /* ^ \2\— V8(P+1) 

P(dxj^ I BH) oc jl dx^. (13) 

Hence the posterior probability of Xj follows the t rule with v degrees 


of freedom, where 


t = (xi-xj/s^,. 


(14) 


is related to s in the same way as the standard error of x^ would be 
to that of one observation if the latter was known accurately. Hence 
it is convenient to quote s^^ as the standard error of x^, s^^, and v 



{3.4 


ESTIMATION PROBLEMS 


129 


are enough to specify the posterior probability of Zi completely, while 
s and V give that of cr completely. 

The situation considered is a common one in practice. A large number 
of unknowns may have to be estimated, but the number of observations 
directly relating to any one may be small. The estimate of any unknown 
from the observations directly relating to it may be of very doubtful 
accuracy on account of the small number of degrees of freedom. But 
if the standard error may be assumed the same for observations of all 
sets the number of degrees of freedom is much increased and a good 
determination of accuracy becomes possible. 

As an example we take Bullard’s observations of gravity f in East 
Africa. Seven stations were visited twice or more, many others only 
once. Those visited more than once were as follows : 


1 



Residual 


g (cm.lsec.*) 

Mean 

(10“* cm.lsec.*) 

Nakuru . 

977-4810 

977-4805 

-f5 


•4800 

— 5 

Kiaumu . 

977-6056 

977-6050 

-}-6 


•6046 

— 5 

Equator . 

977-2608 

977-2605 

-4-3 


•2602 

-3 

Mombasa. 

977-0212 

977-0227 

-15 


•0242 

-flS 

Jinja 

977-7186 


-f4 


•7176 

977-7182 

-6 


•7183 


+ l 

Nairobi . 

977-5289 


-3 


■5307 

977-5292 

+ 16 


•5281 


-11 

Naivasha. 

977-4663 

977-4679 

-16 


•4695 

+ 16 


The sum of squares of residuals is 1499; v = 16—7 = 9; hence 


Then 


10^5 = (1499/9)*'i* = 12-9. 

= ().00129|l,^,^j cm./sec.*^ 

= (0-0013, 0-00091,0-00074) cm./sec.* 


according as the number of measures at a station was 1, 2, or 3; in 
each case based on 9 d.f. 


3.5. The method of least squares. This is the extension of the 
problem of estimation, given the normal law of error, to the case where 

t Phil. Trans. A, 235, 1936, 445-631. 

36*5. &8 K 



180 


ESTIMATION PROBLEMS 


Chap. Ill 


several unknowns besides, usually, the standard error need to be found. 
If the unknowns are x^, m in number, and a measure is c,., then if there 
were no random error we should have a set of relations of the form 

Actually, on account of the random error, this must be replaced by 

1 x, ,, H) = ^exp{ - *„ (2) 

and if there are n observations whose errors are independent we can 
denote them collectively by 6 and write 


P(e\x„cj,H) = _-^^exp{-±5(c -/,)*jdci...dc„. 


(3) 


S denoting summation over the observations. Usually the functions/,, 
are either linear, or else we can find an approximate set of values of 
the x^, say x^q, and treat the actual x,- as the sum of x,o and a small 
departure xj. In the latter case we can take x^ as a new set of unknowns, 
so that within the permitted range df^jdx^ can be treated as constants. 
In either case we can write 


w = (4) 

which will be a quadratic function of x^ or of x-. The accent ean now 
be omitted. We can also write 


fr = 'I.<^ir^O ( 3 ) 

S denoting summation over the unknowns; but we can shorten the 
writing by using the summation convention that when a suffix i is 
repeated it is to be given all values from 1 to m and the results added. 
To avoid confusion through a suffix occurring more than twice we now 

IT = P(a,.,.Xi-c,.)(a^,.x,— c,) (6) 

= iS(ai,aj^XiXj-2ai^c,Xi+c^^) (7) 

= ^bi^XiXj—diXi+^Sc^. ( 8 ) 

In the first sum each pair of unequal suffixes occurs twice, since either 
may be called i and the other j. There is always a set of values of x ; 
that make W a minimum. If we call these y^, differentiate with regard 
to Xf, and put for Xp we have m equations 

= 0- (9) 

These are called the normal equations. They have a unique solution 
if m < and the determinant formed by the is not zero. Put 

Cr—O'iTVi = <• 


(10) 



ESTIMATION PROBLEMS 


131 


§ 3.6 

Then W is quadratic in £(, and its first derivatives with regard to 
all vanish when the are 0. Also W is then equal to Hence 

W = ^b,jZ,z,+ ^ 8 K\ (11) 

Also bjjZ^Zj is essentially positive because it is equal to S(af^Zi)^; and 
it can be reduced to the sum of m squares of linear functions in an 
infinite number of ways. The most convenient is illustrated most easily 
by the case of three unknowns. Suppose 

F ~ zj-f- 26^2 22“f" ^22^2“!“ 26jg2jZ3-t- 26232223*1-63323. (12) 

Take Cr = 2,-f ( 13 ) 

^11 ^11 

Then 

= 622 4 + ^*23 22 23 + 633 2|. ( 1 4 ) 

Now put ./ 

^2 = 2:2 + #23, ( 15 ) 

"22 

F-bnCl-b',,Cl = = bl,zl ( 16 ) 

The process can evidently be extended to any number of unknowns. 
First suppose that a is known, and take the prior probabilities of 
uniformly distributed. Then 

P(dxj^dx2...dx,^ I cr.H) oc dx^...dx^. ( 17 ) 

P{dxi...dx^ I oc cr~’^exp^ — ^^dxi...dx^ 

cc CT-«exp| — ^(6,-,ZiZ^-|-*Sc;*)|dxi...(ia;„. ( 18 ) 

But by the mode of formation of the we see that in the Jacobian 
■ all terms in the leading diagonal are 1 , and all those to one 

0 (Zj,..., z„,) 

side of it are 0. Hence the Jacobian is 1 , and we have the form 

P(dxi...dx„\e,u,H)zc CT-"exp| — ^( 2 ] 6 i^-|->Sc; 2 )|d^i...(i?„,. ( 19 ) 

This breaks up into factors, and we can say for any separately 


(20) 



132 


ESTIMATION PKOBLEMS 


Chap. Ill 


In particular, since Cm = we shall be able to write 

= Vm+^m = 

can be identified easily, for if we write 

D=\\b,j\\ (22) 

for the determinant of all the b^^, and for the minor of in it, 
the transformation alters neither D nor since 

Cm) _ j ^{Cii--‘7 Cm-l) _ j (23) 

a(zi,...,2j ’ a(zi,...,2„_i) 

and therefore b„ = (24) 

Any other function of the can be estimated as follow.s. Let 

^ = l,x, = l,y^+l,z,; (25) 

where the are specified. Then we can eliminate the in favour of 
the Ci, and get ^ ^26) 

where the probability of Ci is distributed about 0 with standard error 
a/'Jbf. Hence that of ^ is distributed about with standard error 

C7(^) given by = a^2 i'^lbi)- (27) 

If a is unknown we must replace (17) by 

P(dXi...dXj^d(T I H)ccdxi...dx^daja (28) 

and (19) by 

P{dxi...dx„da \ OH) x cr-«-iexp| — ^(2 biC\-\- Sc'^'^dCi-dCmda. 

(29) 

Integrating with respect to all the Ci except Cm have 

P(dCmdo I eH) X a-^+'^-^expl — ^(6„ Cm+'S'c;*)| dCmda, (30) 

and then integrating with regard to a, 

P{dCm \ ci...c„H)x (31) 

_ / y/.(^(n_^_l)}!/ ,.,2. 


so that the posterior probability of Cm is distributed as for t with n—m 
degrees of freedom. It is easily seen that the same applies to any 
linear function of the Ci- If n—m is large the distribution becomes 
approximately normal with standard error a(Cm) given by 

<^“(0 = ^CTl(n—m)b^ = B^^Sc/l{n-m)D. 



§3.6 ESTIMATION PROBLEMS 138 

« 

This is the same as the form taken by (21) if we replace a* by 

Sc‘^j{n—m). 

The practical method of solution is as follows. We start with the 
n equations „ /oo\ 

which are called the equations of condition. In general no set of values 
of will satisfy them all exactly. But if we multiply each equation 
byayr and sum for all values of r, we obtain the equation 

(34) 

by the definitions of and d,-. This is done for all values of j from 1 
to m, and yields m equations for x^. These are the normal equations. 
Their solution as simultaneous equations is 

Xi = yj. (36) 

The most convenient process of solution is identical with that of finding 
the ^ For if we divide the first equation by the function on the left is 

®ii °ii \ ‘'ll / 

Multiplying this in turn by 612, b^^,... and subtracting from all the others, 
we eliminate Xj from all. Thus we are left with m— 1 equations, which 
stiU have the property that the coefficient of x^ in the equation for Xj is 
equal to that of x^ in the equation for x,-; for both are equal to 

bij ^if^u/^ii- 

We can therefore proceed to eliminate all in turn, finishing with x„, 
the coefficient of which will be b^, and 6„, is therefore yielded auto- 
matically. Any other coefficient is the coefficient of x^ in the first 
equation remaining when x^ to x,-_i have been ehminated. Thus the 
process of solution yields all the 6;. If a is initially knowTi, all that 
remains is to express any unknown, say x^, in the form di/6i±o^/'v^fti+ 
a linear function of yg y„ and of to in this we use the second 
equation to replace by a constant dt^Hb^ with functions of 3/3 to y„ 
and of $3 to and so on. Thus finally we obtain the value of y^, which 
is the most probable value of Xj, and a set of independent uncertainties 
of Xj, which are easily combined. 

If a is initially unknown we proceed to estimate the as before; 
then substituting in the equations of condition we obtain the set of 
differences c,.— a^,y,-, which are called the residuals, and are identical 
with c' . Then we can define the standard deviation of one observation by 

(n— m)5* = Sc'.f, (37) 



134 


ESTIMATION PROBLEMS 


Chap. Ill 
(38) 


and that of by = «/V6„. 

Put t = and we have 


P(dz„lCi...c„Z?)oc 



+ '' 1 

l^ + (w-wi)W 

n—m] 




dt. 


which is of exactly the same form as 3.44 (13). If n—m = v, v is again 
the number of degrees of freedom, and the t table can be used as in the 
simpler cases. 

This method (essentially Gauss’s method of substitution) has great 
advantages over some of those usually given, which involve the working 
out of m+1 determinants of the with order to obtain the y^, and the 
evaluation of the first minors of aU terms in the leading diagonal of D to 
find the standard errors of the y^. Personally I find that to get the right 
value for a determinant above the third order is usually beyond my 
powers, but the above process usually gives me the right answer. The 
symmetry of the equations at each stage of the solution gives a useful 
check on the arithmetic, and the correctness of the final solution can 
be checked by substitution. 

A method due to Laplace is often said to be independent of the normal 
law; but it assumes that the ‘best’ estimate is a linear function of the 
observations, and if there was only one unknown this would imply by 
symmetry the postulate of the arithmetic mean, which in turn implies 
the normal law. Further, it assumes that the error is estimated by the 
expectation of its square, which is justified by the normal law but has 
to be taken as a separate (and wrong) postulate otherwise; and an 
unnecessary appeal to Bernoulli’s theorem has to be made.f 

3.51 . To illustrate the method of solution, consider the following set 
of normal equations (1), (2), (3); the standard deviation of one observa- 
tion is s. 


12* — 6j/-t-4z = 

2 

(1) 

*-0-421/ 4-0- 33z = 

4-0-17±0-28s 

(4) 

— 5x+8y+2z — 

1 

(2) 

6*- 2-1?/ 4-1-7? = 

4-0-8 

(6) 

4*-l-2y-f 6z = 

5 

(3) 

4*— 1-71/4-1 -Sz = 

4-0-7 

(6) 

5-9y+37z = 

+ 1-8 

(7) 

?/4-0-63z = 

4-0-31±0-41« 

(9) 

3-ly + 4:-7z = 

-h4-3 

(8) 

3-71/4- 2-3z = 

4-1-1 

(10) 

2-4z = 

4-3-2 

(11) 

z = 

4-l-33±0-64« 

(12) 


y = 4-0-31 

-0-63x1-33 = -0-63 


(13) 


X = 4-0-17 

- 0-42 X 0-53- 0-33x1-33 = 

= -0-49 

(14) 


(4) is got by dividing (1) by 12; (6), (6) by multiplying (4) by 5 and 4. 
Then (2) and (6) give (7), and so on. These results should be checked 


t Cf. Phil. Mag. 22, 1936, 337-69. 



ESTIMATION PROBLEMS 


135 


§ 3.5 


by substitution in the original equations. The standard error 0-285 in 
the first line is «/Vl2, and similarly for the others. For Sy we have 

8y = ±0-41s±0-63x0-64« = (±0-41i;0-41)s ^ ±0-585, (15) 

and for 5^ 

X = (a:— 0-42j^±0-332)±0-42(y±0-63z)— 0-602; (16) 

«x = (±0-28±0-42x0-41±0-60x 0-64)5; (17) 

5* = 0 - 2952 , 5 ^ = 0-545. (18) 

Hence 


X = _0-49± 0-545; y = — 0-53± 0-585; 2 = ±l-33±0-645. (19) 

3.52. Equations of condition of unequal weights; Grouping. 

In the argument of 3.5 we have assumed that every measure has the 
same standard error. If the standard errors are unequal, 3.5 (3) will be 
replaced by 

n*. (1) 

and the exponent is stiU a quadratic form. It differs from IF in so far 
as each term of the sum has to be divided by al before addition. 
Ckmsequently the quantities or their products by a convenient 
constant, are called the weights of the equations of condition. It will 
be noticed that (1) is the same as if we replaced the equations 


by 


= Cy±a^ 

(2) 

b 

± 

11 

(3) 


and took each observation as one of with the same uncertainty a. 
If the are known and a is chosen conveniently the formation and 
solution of the normal equations wiU proceed exactly as before. 
Evidently the arbitrary a will cancel in the course of the work. This 
procedure is convenient as an aid to seeing that the method needs only 
a slight alteration at the outset, and is sometimes recommended as a 
practical method; that is, it is proposed that the whole of the equations 
of condition should be multiphed by their respective ct/u, before forming 
the normal equations. This has the disadvantage that the weights are 
often integers and the multiplication brings in square roots and conse- 
quent additional rounding-off errors. It is better to proceed as follows. 
If , . 



136 

ESTIMATION PROBLEMS 

Chap. Ill 

W is also equal to 

Kr *^-Cr) j. 

(5) 

and 

5 

(6) 


Consequently, if we first multiply every equation of condition by its 
weight and then form the normal equations by multipl 5 dng by 

and adding, we get the same equations with less trouble and more 
accuracy. 

If the CT, are unknown and some of them mutually irrelevant there 
will be a complication similar to that of 3.41 (14). But it often happens 
in a programme of observation that some observations are recorded 
as made in specially favourable conditions, some moderate, and some 
poor. It is usual to deal with this by attaching impressions of the 
relative accuracy in the form of weights, somewhat arbitrarily, though 
a determination of the accuracy of observations in the various grades 
would be possible if the residuals were classified. Our problem, if the 
relative accuracies are accepted, is to obtain an estimate of accuracy 
when the a, are not taken as known, but their ratios are taken as known. 
We take a as the standard error corresponding to unit weight and 
proceed as just described. If tc, = a^ja^ is the weight of the rth observa- 
tion the term Sc'^^ in 3.6 (29) will be replaced by Sw^ c'^. The only change 
in the method of estimating a is therefore that in forming as in 3.5 
(37) we must multiply each c’^ by the weight of the observation. 

The observations often fall into groups such that within each group 
all the are nearly the same. The extreme case of this condition is 
the problem of 3.44, where for the ith station — 1 if the observation 
is at that station and 0 if it is at any other. In the determination of 
an earthquake epicentre from the times of arrival of a phase at different 
stations, the stations fall into geographical regions such that within 
any region the time of arrival would be altered by nearly the same 
ammmt by any change in the adopted time of occurrence and the 
position of the epicentre. It then simplifies the work considerably to 
form an equation of condition for the mean position of the stations in 
the region and to use the mean c, for it. The standard error of the latter 
will be ff/Vn,, where is the number of stations in the region, and 
therefore it supplies an equation of condition of weight The normal 
equations will be nearly the same as if all the stations were used to 
form separate equations of condition. All the residuals are still available 
to provide an estimate of a, which will be on ra— m degrees of freedom 



ESTIMATION PROBLEMS 


137 


§ 3.5 

just as in the treatment without grouping. If we chose to use the 
method described in the last paragraph we should get the same least 
squares solution, but only the mean residuals in the groups would be 
available to provide an estimate of uncertainty, which would therefore 
be on many fewer degrees of freedom. 

3.53. The following data, given by E. C. Bullard and H. L. P. Jolly,t 
provide a more complicated instance of the method. The unknowns 
are the values of gravity at various places. In general gravity is not 
measured absolutely, but the difference between the periods of the same 
pendulum when swung in different places is found, thus giving an esti- 
mate of the difference of gravity. This is referred to a standard value 
for Potsdam, where an absolute determination exists. In the following 
set of equations of condition, therefore, absolute values refer to stations 
compared directly with Potsdam; the rest are differences. Bullard and 
Jolly took De Bilt as given, but it appears that the comparison of De 
Bilt with Potsdam has an appreciable uncertainty compared with those 
of some of the English stations, and it seems best to treat it as an 
additional unknown. The unknowns are then; 
gQ, De Bilt. 

gi, Greenwich, Record Room. 

gj, Greenwich, National Gravity Station, 
ga, Kew. 

g^, Cambridge, Pendulum House, 
gj, Southampton. 

The equations of condition are: 


Obaenrer 

Dale 



Putnam ..... 

1900 

fir, = 981188 

(1) 

Putnam ..... 

1900 

j7, = 981-200 

(2) 

Ijenox-Conyngham . 

1903 

= -f0014 

(3) 

Meines?, ..... 

1925 

9t~9„ = -0-003 

(4) 

Lonox-Conyngham and Manley. 

1925 

9i~'9a — +0-0647 

(5) 

Jolly and McCaw 

1927 

9t~9i = +0-0003 

(6) 

Miller ..... 

1928 

Sr, = 981-1888 

(7) 

Jolly and Willis 

1930 

9t~9g= +0-0742 

(8) 

Willis and Bullard . 

1931 

fir* — 9g — +0-0653 

(9) 

Jolly and Bullard . 

1933 

9t — 9g — +0-1431 

(10) 

Bullard ..... 

1935 

9t — 9g — +0-1390 

(11) 

Meinesz ..... 

1921 

= 981-267 

(12) 

Meinesz ..... 

1925 

g, = 981-269 

(13) 


The unit is I gal = 1 cm./sec.* 


A main source of error is known to be change of the mechanical proper- 
ties of the pendulums during transport. Hence all the equations will 
t M.N.R.A.S., Goophys. Suppl. 3, 1936, 470. 



13S 


ESTIMATION PROBLEMS 


Chap. Ill 


be taken of equal weight except (6). For this the stations are only 
300 metres apart and at nearly the same height, and the difference can 
be calculated more accurately than it can be measured. I take 

Qi — 9\ = -j-O-OOOl. 

An approximate set of solutions is easily found; we write 


go - 981-268+a:o- (14) 

Pi = 98M88+X1, (1®) 

gj = 981-18814-Xi, (16) 

Pa = 981-200+X3, (17) 

g^ = 981-265+x,. (18) 

g, = 98M23+Xa. (19) 

Then the equations of condition, omitting (6), become 

= 0 - 0000 , ( 1 ') 

X3 = 0-0000, (2') 

X 3 -X 1 = +0-0020, (3') 

= 0-0000, (4') 

X4~X3 = —0-0003, (5') 

Xi = +0-0007, (7') 

^4—^1 = -0-0027, (8') 

X5 = +0-0002, (9') 

X4-X5 = +0-0011, (10') 

^4— ^5 = —0-0030, (11') 

X3 = -0-0010, (12') 

^0 ~ +0-0010, (13') 


Xg occurs in equations (4'), (12'), (13'), with coefficients —1, +1, +1. 
We therefore add (12') and (13') and subtract (4') to give the normal 
equation for Xg, namely 

Sxq— X 4 = 0 - 0000 . 

Xi occurs in (1'), (7'), (9') with coefficient + 1, in (3'), (8') with coefficient 
— 1. We therefore multiply (3'), (8') by —1 and add to the sum of 
(l')> C^')’ (9')- Similarly we proceed for the others. 


Normal equations 


— X 4 = 00000 

6X1 — Xg — Xg — Xg = +0 0016 

— Xi + 3xg — Xg = +00023 

— Xg— Xi — Xg + 6xg — 2xg = —0-0049 
— Xj — 2xg+3x5 = +0-0017 


(20) 

(21) 

( 22 ) 

(23) 

(24) 


^0 


-0-3333xg = 0-0000 (25) 



§3.5 


ESTIMATION PROBLEMS 


139 


First divide (20) by 3; the result is (25). To eliminate Xg we have only 
to add (25) to (23). Then 

Sx, - X , - x ^ -Xi = -I-0-0016 (21) x,-0-2xj-0-2x4-0-2x4 = +0-00032 (27) 

-Xi + 3x, — X4 = +0-0023 (22) 

-Xj -x, + 4-0667x«-2x5 = -0-0049 (20) 

—X, — 2X4 + 3X5 = +0-0017 (24) 

Now eliminate x^, 

2-8x, -1-2x4 -0-2x5 = +0-00202 (28) I x,- 0-4286x4 -0-071 4x4 = +0-00094 (31) 
-1-2x3+4-4007x4-2-2x5 = -0-004.')8 (29) 1-2 x,-0-.'>143x4-0-0857x 5 = +0-00113 (32) 
-0-2xj —2-2x4-2 8x5 = +0-00202 (30) 0-2xj-0-08.57x4-0-0143x5 = +0-00019 (33) 


3-9624x4-2-2867x5 = -0-00345 

(34) 

X 4 - 0-6783x5 = -0-00087 

(36) 

-2-2867x4 + 2-7867x5 = +0-00221 

(3.5) 

2 - 28 . 57 x 4 - 1 - 3218 x 5 = - 0-00200 

(37) 

1-4639x5 = + 0-00021 

(38) 

X 5 = +0-00014 

(39) 


Hence the solution is, from (36), (31), (27), (25) in turn, 

Xg = —0-00020 f 

Xi = +0-00031 

Xg = +0-00061 >. (40) 

X4 —0-00079 

Xg = +0-00014 j 

Substituting in the normal equations we find that the largest discrepancy 
is 5 in the fifth decimal, so that the solution is checked. A check on 
the formation of the normal equations is got by noticing that most of 
the equations of condition are differences; hence the sum of the right 
sides of (1'), (2'). (7'), (12'), (13') should be that of the right sides of (20) 
to (24). f Now substituting in the equations of condition we get the 
calculated values. Residuals are multiplied by 1000 for convenience. 



Calc. 

0~C 

c'2 

(1') 

+0-31 

-0-31 

0-10 

(2') 

+0-61 

-0-61 

0-37 

(3') 

+0-30 

+ 1-70 

2-89 

(4') 

-0-59 

+ 0-59 

0-35 

(5') 

-1-40 

+ 1-10 

1-21 

(7') 

+ 0-31 

+ 0-39 

0-15 

(S') 

-1-10 

— 1-60 

2-56 

(9') 

+ 0-17 

+0-03 

0-01 

(10') 

— 0-93 

+ 2-03 

4-12 

(11') 

-0-93 

-2-07 

4-28 

(12') 

— 0-20 

— 0-80 

0-64 

(13') 

—0-20 

+ 1-20 

1-44 

18^ 


t For general methods of cheeking when the number of normal equations is large, see 
H. and B. S. Jeffreys, Methods of Mathematical Physics, p. 283. Method (2) mentioned 
on p. 284 will also check the formation of the normal equations themselves from the 
equations of condition. 



140 ESTIMATION PROBLEMS Chap. Ill 

We have 12 equations and 5 unknowns have been found; hence 

= 18-12/(12—5) = 2-69; a = 1-60 milligal. (41) 

The uncertainties of the separate determinations have still to be 
found. Denote departures from the least squares solutions by accents 
and take 1 milligal as the unit. Denote, apart from accents, the 
quantities on the right of (25), (27), (31), (36), (39) by X^; these have 


independent uncertainties. Then 

X; = = ±l-60/(l-46)'fe = ±1-32. (42) 

X; = ±1-60/(3-95)V2 = ±0-80, (43) 

Xg = ±1-60/(2-80)’/2 = ±0-96, (44) 

X[ = ±1 -60/(5-00) = ±0-71, (45) 

Xo = ±1-60/(3-00)’/2 = ±0-92. (46) 

x; = X;±0-578x; = ±0-80±0-76 :r= ±1-10, (47) 

Xg = X3±0-4286X;±0-177 x'5, (48) 

= ±0-96±0-34±0-23 = ±1-04. (49) 

and so on. The final solution is 

Qq = 981-26780±0-00099, (50) 

= 981-18831±0-00092, (51) 

gTg = 98M8841±0-00092, (52) 

ffs = 981-20061±0-00104, (53) 

9^4 = 981-26421±0-00110, (54) 

^5 = 98M2314±0-00132. (55) 


From the t table for 7 degrees of freedom we find that the probability 
of an error numerically greater than 2 milligals ranges from about 
0-07 for and to b'lS for g^. 

The standard errors are not much less than for one determination. 
This is ultimately because, of the 12 equations, only 5 represent direct 
comparisons with Potsdam. Even if the differences were exactly deter- 
mined the standard errors could not be less than 1-60/V5 = 0-72 milli gal 
The fact that most of the equations give differences makes the normal 
equations far from orthogonal, as is shown by the fact that the coefficient 
of Xg drops from 3-0 in (24) to 1-46 in (38). 

Seidel’s method (see p. 173) was tried on these equations, but conver- 
gence was too slow. This method is really adapted only to problems where 
the equations are nearly orthogonal, otherwise the estimation of uncer- 
tainty becomes more laborious than the solution of the normal equations. 

With a slight modification, however, the method succeeds. The 
difficulty arises principally from x^; the equations give direct determina- 
tions of Xg, Xj, and Xg, while X 4 is connected directly to all these three. 



ESTIMATION PROBLEMS 


141 


f 3.6 

But Xg has only a single connexion with and two with x^. Hence x^ 
really has little to say concerning the values of the other four, which 
would be well determined without it. If we drop the equations con- 


taining Xj we have the normal equations 

3xo ~x^ = 0-0000, (66) 

4xi — X 3 — X 4 = -)-0-0014, (57) 

— Xi-f 3 x 3 — X 4 = +0-0023, (68) 

— Xq— X j — X 3 + 3 X 4 = — 0-0030. (59) 


These are nearly orthogonal. The largest term on the right is in (59); 
we therefore take a first approximation X 4 = — 0-0010. Then from (56), 
Xq = —0-0003; from (58), X 3 = +0-0004; and from (57), x^ = +0-0002. 
Substituting these approximations in the left sides we have in turn 

3 x 3 —X 4 — + 0 - 0001 , 

4xj ~Xg — X 4 = +0-0014, 

— X 1 + 3 X 3 — X 4 = + 0 - 0020 , 

— Xq— X j — X 3 + 3 X 4 = —0-0033. 

Comparing with the original equations we see that (58) and (59) are 
both +0-0003 higher, and that we can add 0-0001 to X 3 and X 4 . Then 

a-„ = -0-0003; Xj = +0-0002; X 3 = +0-0005; X 4 = —0-0009. 

This is very near the solution (40). 

If the equations were strictly orthogonal the standard errors would 
be <t/V3, (t/ 2 , o'/V3, cj/VS, and independent. To a second approximation 

<^^(* 3 ) = i‘^^+6<^®(3^l)+ftU*(a;4), 

= J<^* + 9<^Vo) + 6«^*(a:4)+Jcr2(X3). 

By iteration we find, nearly, 

— 0-40(t-; ct''(Xi) = O-Sla^; cr'^+j) — 0-42(7^; = 0-46or^. 

a- 5 == i(X 4 + 0 - 0009 ) + J(x 4 - 0 - 0002 )±a /%3 
= 0-0000+ 0-56’/=a. 

a is estimated as before, and the solution is 

Xg = —0-0003+0-0010, 

Xi = +0-0002+0-0009. 

X 3 = +0-0005+0-0010, 
x^ ^ —0-0009+0-0011, 

Xj = 0-0000+0-0012. 

The accuracy would be enough for all practical purposes. 



142 


ESTIMATION PROBLEMS 


Chap. Ill 


3.54. The following problem, and various extensions of it, have often 
occurred in astronomy. There are cases where a group of stars can be 
assumed all to have the same parallax; the estimates from any star 
separately are comparable with their standard errors, but the mean of 
all is substantially more than its standard error. The physical restriction 
here is that a parallax cannot be negative. It is substantially less than 
the standard error of one observation, and we may adopt a uniform 
prior probability over positive values. If, then, a is the general parallax 
and a^, the separate estimates with their standard errors, the number 
of observations in each case being large, we have 

P(da^...da„ \ <xH) oc expj — ^ ^ 1 da^...da^ 

and P{daL \ H)cc da (a > 0) ; =0 (a < 0). 

Then 

P{da I a^...a„ H) oc exp| — ^ ^ da (a > 0) ; =0 (« < 0). 

The posterior probability of a is therefore a normal one about the 
weighted mean of the a^, but it is truncated at a = 0. 

The treatment of such problems has given rise to some discussion. 
In the conditions of the problem some of the estimates are usually 
negative. These have sometimes been rejected as impossible, and a 
mean is taken of the positive ones. Then the rejection of a large frac- 
tion of the negative random errors biases the mean by an amount com- 
parable with the standard error of one determination. We are entitled 
to allow for the impossibility of a negative true parallax, but this can 
only be done at the end when we take the prior probability into account. 
If only one star was in question we should still be entitled to take it 
into account. We must not, however, do it by rejecting factors from the 
likelihood. The point is somewhat similar to one that arises in one case 
of the combination of correlation coefficients (p. where there is a 
constant term in ? — z arising partly from the prior probability and partly 
from the likelihood. But when several estimates are combined the part 
from the prior probability only enters once, while that from the likeli- 
hood enters every time. Similar considerations have occurred in the 
estimation of the focal depths of shallow earthquakes. Here the depth 
h enters through h^; and the least squares solution is liable to give 
negative h^. There are two valid treatments possible. One is to take 
h as zero in all cases in the estimation of other parameters, especially 
the velocities, thus regarding the whole of the estimated values as not 



ESTIMATION PROBLEMS 


143 


§ 3.6 


significant. The other is to eliminate h from all the solutions and com- 
bine the equations for the velocities. What is not valid is to reject the 
cases of negative estimated A® and determine the velocities from the 
rest; this gives a bias in the estimated velocities. 


3.6. The rectangular distribution. This distribution is of theoretical 
interest on account of the fact that the mean of all the observed values 
gives a less accurate estimate of the centre of the distribution than the 
mean of the two extreme observations does by itself. Let the centre 
of the distribution be a. and the range 2a, to be determined. The chance 
of an observation in a range dx is 


P{dx \(x,a,H) 


dxj2a {a. — a < x < a-f-a), 

0 (x < ot— a, X > a-f-c). 


( 1 ) 

( 2 ) 


The chance of n observations in given ranges is 

P(dxi...dx„ I o£,a,.^f) = n {dx)l{2ay‘, (3) 

provided that all the x, satisfy the conditions 


a— a < X, < a-J-cr, (4) 

and therefore provided that the extreme observations satisfy them. 
Call these Xj and x,^. We take oc and a as initially unknown, and 
therefore 


P{dotda I H) oc dacdoja 
and P(dada | x^...x„ H) oc a"'*“* doida, 

provided now a— a < Xj; a-fa > x^. 


(5) 

( 6 ) 
(7) 


These conditions fix the possible joint range of a and a, given the 
observations, and apart from the restrictions on the range the observa- 
tions do not appear in (6). Hence, with the rectangular law, the two 
extreme observations are sufficient statistics for a. and a. 


Then P(da | Xj...x„ H) oc da j" a“”“' da (8) 

through the permitted range. But, given a, a must be greater than the 
larger of a— Xj and Xj— a; the lower hmit for a is therefore a— x^ if 
a > and x^—a. if a < ^(Xi-I-Xj). Hence 


P(da |Xi.,.x„H) oc 


(a — Xi)“"da 
(Xj — a)“"da 


{a > i(Xi-f Xjj)}, 
{a < i(Xi-fx 2 )}, 


(9) 

( 10 ) 


with the same constant factor in both cases. The posterior probabihty 
for a, therefore, has a sharp peak at the mean of the extreme values. 



ESTIMATION PROBLEMS 


Chap. Ill 


144 


The constant factor is easily found to be 2~"(»—l)(a;2—a;i)’‘ If n — 2, 

we have 


P{X^ <0L < Xz\X^,X^,H) 

= 2~”(n—l)(x2—Xi)”~^ 

= h 



<x 


)-" da + 






Thus, if we have only two observations, and a and a are originally 
unknown, the posterior probability that a lies between the observed 
values is This is a general rule for any continuous law of error; we 
have already had a case of it for the normal law. 

The possible values of a, given a, range from x^—a to Xi-\-a, provided 
the latter is the greater. Then 

P{da I x,...x,H) ac (12) 


for a > ^(x^—Xi). The constant factor is found to be 

2~^n{n—\){x^—x-i)'^~^. 

If « = 1, the range for a is from x^~a to Xi+c, and (6) leads to 

P{da I Xi, H) oc daja, (13) 

which expresses the same fact as for the normal law, that one observa- 
tion can tell us nothing about its own accuracy. It may be noticed 
that the probability density for a vanishes at a = ^(a: 2 — Xj) and has 
a maximum at ct = ^(l-[-l/»i)(^ 2 ~*i)- This is because the extreme 
value would require both Xi and otj to have fallen at the extremes of 
the law, which would be surprising, but it would not be surprising that 
both should fall a little within them. 

On account of the form of the limiting conditions the posterior prob- 
abilities of a and <T are far from independent; any inference that 
involves both should proceed from (6) directly. If we want the termini 
ai = a— <7 and a^ = (6) transforms to 

P(daida 2 \Xj^...XjiH) oc daidaJ{a 2 —aj)^^+^ (aj < Xi,a 2 > x^); (14) 
whence for a^ > Xj, 

P(dai\Xi...x^H) = (n—l){Xz—X^Y-\a^—x.^)-^ da^. (16) 

If we fix limits such that the probability that a, a^, or aj lies between 
them has any definite value, the distance between these limits will 
decrease like l/n as the number of observations increases, whereas with 
the normal law of error the corresponding distance decreases like 1/V». 
This kind of result usually arises for laws of error with a finite range 



ESTIMATION PROBLEMS 


146 


$ 3.6 


where the gradient of the law is non-zero at an extreme, and especially 
for any U-shaped or J -shaped law. The rectangular law is merely the 
transition from a bell-shape to a U-shape. 

The use of the mean and second moment as location and scale para- 
meters in such cases sacrifices much information. For with the rect- 
angular law the second moment of the law is and the standard 
error of the mean of n observations, given a, will be a/^(^), thus 
diminishing like l/Vn, whereas any range for a definite probability that 
a lies within it will diminish like 1 jn if we use the most accurate methods 
of fitting. 

3.61. Re -scaling of a law of chance. As many laws do not lead to 
sufficient statistics, as the normal and rectangular laws do, it has 
sometimes been suggested that it would be beneficial to choose a new 
variable whose law will be normal or rectangular. Thus if the law is 


X 

, « V — r — a\dx 

we can define = I /| — , 

a J a / o 

— 00 


and then 1 H) = dyja {(X <y < a-f-ff). 

Similarly we could define a z such that 


e 



{z-pr 


— oo 


and the chance of z is normally distributed about with standard error a. 

It has been suggested that such transformations can be used to 
simplify methods of estimation, but they are useless. In the first place, 
for given x we do not know the corresponding value of y or z until we 
know ot and a; and the whole reason for an estimation problem is that 
we do not. In the second, if x can be transformed so that 

X V 

J /(x) dx = J g{y) dy, 

— 00 —00 

dx 

then P{dy | !x,<t,H) = g(y)dy = f(x)~dy, 

where dxjdy will also depend on a and a. If values of x are observed, 
the correct likelihood factor is JJ /(^r)- instead we use y we shall 

get a factor X][ ?(3/r)- Thus the two likelihoods will differ by a factor 
IT (<^yld^)x=x,^ a function depending on a and a for every observation. 

3695.6tt r 



146 


ESTIMATION PBOBLEMS 


Chap. Ill 


It is remarkable that such maltreatment of the likelihood is recom- 
mended (but so far as I know not used because it cannot be) by 
statisticians who object to the prior probability, which only appears 
once in any given problem. 

3.62. Reading of a scale. The commonest case where errors do not 
satisfy a normal law is the measurement of a length by means of a 
scale, the positions of the ends being read to the nearest multiple of the 
scale interval. Let the length of the object be L units. Two cases arise. 
In the first, we place one end of the object at a graduation, say the mth, 
and read the position of the other to the nearest graduation. Then 
clearly we shall always record the length as k units, where k is the 
integer nearest to L. Hence, for any k, 

P{k\LH) = l {-kaL-k<\), P{k\LH)^0 {\L-k\'^\). 

If n observations are made, and P(dL j H) oc dL, 

P{dL\eH) = dL (-.i<L-k<i), 

P(dLI0B) = O (IL-kl>i). 

In this simple case increasing the number of measurements does nothing 
to increase the accuracy of the determination. The posterior probabihty 
distribution is rectangular. 

In the second case, we may put one end at an arbitrary position on the 
scale, say at m-h^ units from one end, where — j < y < | ; if the length 
is L = k-f-x units, where 0 < x < 1 , the nearest graduation to the other 
end will be the (m-(-A:)th if | x+y | < J, that is, if — J < y < x, and 
will be the (m-t-i+ljth if |x-fy | > that is, if x < y < ^. But 

P(dy|H) = dy (|y|<I), P(dylff)^0 (|y| > |) 
and therefore 

P(k lLff)= 1 -x; P(I + 1 1 LH) = X. 

If r observations give the value k, and s the value fc-f-l, we have 
P{e\LH) = (l-x)’-x»; 

P{dL I 6H) cx (1— x)’'x*dx = ( 1 — x)^x^ dx. 

The coefficient of dx is a maximum if 


X = 


r-f-« 




so that the most probable value is the mean of the observed values. 


{ V8 1 

} which is not 

(r+s)®) ’ 

independent of Xq. 



§3.0 ESTIMATION PROBLEMS 147 

By a theorem due to Gfauss (p. 189), if the probabihty of an error, 
given the true value, is a function of the error alone, and if the likelihood 
is a maximum when the true value is taken to be the mean of the 
observed values, the law of error must be normal. In this problem the 
second condition is true but the conclusion is false. The first condition 
is false because P(k \ LH) is not a function of k — L alone; if we vary L 
but keep k—L an Integer, k will take non-integral values, which are 
forbidden by the conditions of the problem. Keynes has shownf that if 
the law of error is 

P(dx I ^H) = f{x, f ) dx, 

where / is twice differentiable with regard to f , a necessary and sufficient 
condition for the maximum likelihood estimate to be always the arith- 
metic mean of the observed values x^ is 

iogfix,^) 

The normal law corresponds to 

= -^2/2a2; 4,(x) = -x2/2ff*+ constant. 

Tlie law for measurement by difference corresponds to 

m = (^-l)Iog(l-^)-^log^; 4,{x) = 0. 

The Poisson law corresponds to 

= -llogf; </r(x) = — X-logx!. 

These reductions to Keynes’s form are due to M. S. Bartlett.J 

TJiis problem is of some theoretical interest. In practice the peculiar 
behaviour of the posterior probability would lead to difficulties in 
calculation. These are reduced if we can reduce the step of the scale, 
for instance by means of a microscope, so that the error of reading is no 
longer the principal source of error. 

3.7. The posterior probabilities that the true value, or the third 
observation, will lie between the first two observations. Let us 

suppose that a law of error is given by hj{h[x—a))dx, where / may 
have any form and h plays the part of the precision constant, or the 
reciprocal of the scale parameter. Put 

Z 

J /(z) dz = F{z), F(co) = 1. (1) 

— OO 

t Trealim on Probability, p. 197. 
t Proc. Roy. Soc. A, 141 , 1933, 624-6. 



148 ESTIMATION PROBLEMS Ch»P- ^ 

If a and h are originally unknown, we have 

P(d(xdh 1 H) oc dtxdhjh, (2) 

P{dx^dx^ I ol,}i,H) = h'^f{h(x^-cc)}f{h(x^-oL)} dx^dx^, (3) 

and P{dodh \ x^, x^, H) az hf{h(x^ — a)}/{A(x 2 — a)} da.dh. (4) 

The probability, given Xj and x^ {x 2 > x^), that the third observation 
will lie in any range dx^, is 

P{dx2\x^,X2,H) = ^ j P{dx2doLdh\Xi,X2,H), ( 6 ) 

integrated over all possible values of a and h, 

cc dxg JJ h^f{h(Xi—!x)}f{k{x2—ct)]f{h(x3—a)} dadh. (6) 

Let us transform the variables to 

6 = h{x^—oi), <f> = /i(;r 2 — a). (7) 


The probability, given and Xj, that a is between them is / 1 // 2 , where 
and I2 are got by integrating (4) from 0 to 00 with regard to h, and 
respectively from Xj to x^ and from —00 to go vith regard to <x. Then 

0 (30 

(X 2 -X,)I, oc J J mm ded4> = jf’(0){i-i;’(0)}. (8) 

— ooO 

00 00 00 

(X2-X2)/2 0C f J mmdBd^ = f m{^-m]de = i-^ = j. 

-00 a -CO 

Hence / 1//2 = 2^’(0){1-Jf’(0)). (10) 

If then F(0) = the ratio is J. In all other cases it is less than 
Referring to (1) we see that J^(0) = J is the statement that for any 
given values of a and h an observation is as likely to exceed a as to 
fall short of it. There wiU be such a value for any continuous law of 
given form. Hence, if we define the true value to mean the median 
of the law, then the probability, given the first two observations, that 
the true value lies between them is |, whatever their separation. If we 
chose any other location parameter than the median of the law, and 
the law was uns 3 rmmetrical, the ratio would be less than |. This is a 
definite reason for choosing the median as the location parameter in 
any case where the form of the law is unknown. We have already 
obtained the result in the special cases of the normal and rectangular 
laws. 



ESTIMATION PROBLEMS 


U9 


The probability that will lie between and x^, given the latter, 
is where 

00 00 Xt 

/s = I / ^ h'^f[h(x^—ot)}f{h{x.^~0L)}f[h{X3—a.)}dodhdx3 (11) 

- 00 0 Xi 
00 CO 

= 11 hf{h(Xi—oi.))f{h(x2—cx)][F{h{x2—aL)]—F\h{x^—oi.))\dodh 

— 00 0 

CO CO 

= _1_ r \ f{e)m[F{<f,)-F(e)}ded<f> 

-00 g 
00 

= ---- f [\m{\-F-\e)]-f(d)F(d)[\-F{e)]]dd 

^2 — a?! J 

— a) 

= ---a-i-i + h) - ,^ 7 --— ( 12 ) 

00 CO 00 

h J / / h'^S{HjCi~<x)}f{h{x^—oJ)}f{h{x^—OL)]dadMx^ (13) 


CD 00 

==■-- J J hf{h{Xi—a.))f[h{x^—o>.)]d3Ldh 
— 00 0 
CD CD 

-- / / mf{4>) ded<f, 

-00 ff 
00 

-= jm{i-F(e)}dd 

— 00 

_ 1 

2{x„—x^)’ 

whence 


(14) 

(15) 


Thus, if the location and scale parameters are initially unknown, the 
probability that the third observation Avill lie between the first two, 
given the first tw'o, is J whatever the separation of the first two. 

The converse theorem to (10) would be that if the posterior proba- 
bility that the median of the law lies between the first two observations 
is ^ whatever their separation, then the prior probability for h must 
be dhjh. If it was X{bh)d}ijh, w'here 6 is a quantity of the dimensions 
of a or of l//t, the ratio IJI^, would involve bjix^—Xy) and could not be 
the same for all values of r,. The only possible modification would 


therefore be 


P(dotdh/H) oc hy-^dadh. 


(16) 



160 ESTIMATION PROBLEMS Chap. Ill 

The question is whether y is necessarily 0 for all admissible forms of 
/(z). If we put h{l{x^+x,)-cc} = t, (17) 

\h{x^—x^ = s ( 18 ) 

we find hf{h{Xi—a)}f{h{x2—oc)}dcx=—f(t—s)f(t-\-s)df. (19) 

and (16) in place of (2) will lead to 


00 8 OO 

/i-i/,oc J (2 / - / W(<-.s)/(/ + ,s') 

0 ' ~S —CO' 

^2 I - J j/(Z— «)/(<+«) dt = G{fi). 


Put 


Then our postulate reduces to 


( 20 ) 

( 21 ) 


J «y6r(.«) ^ 0, (22) 

0 

and we know from (10) that this is satisfied for all a-j. if y — 0. 
A sufficient condition for the absence of any other solution would be 
that G{s) shall change sign for precisely one value of s; for if this value 
is Sq, we shall have 

CO 

J 4G'(«) ds = 0, (23) 

0 


and for positive y the integrand in (22) is numerically larger than in 
(23) when s > Sg and smaller when s < Sg. Hence (22) cannot hold for 
any positive y, and similarly for any negative y. It has not been proved 
that 0{s) has this property in general, but it has been verified for the 
cases where /(z) oc exp( — |z*); /(z) = — ^ exp{— |z|}; /(z) = J for 
— 1 < z < 1, and otherwise = 0; and for a remarkable case suggested 
to me by Dr. A. C. Offord, where 

/(z)=l/2z'= (|z|>l), /(z) = 0 (|z|<l). 


The property has an interesting analogue in the direct problem. Start- 
ing from (3) and putting = 2a, Xj— x^ = 2b, we find 


P(da I bcJiH) ^ . 

J h^f{h(a—b—oc)}f{h(a+b--(x)} da 


(24) 


The condition that x^— a and x^— a shall have opposite signs is that 
[a— a I <b. Hence for any b we can find the difference between the 



§ 3.7 


ESTIMATION PROBLEMS 


151 


chances that two observations with separation 26 will have opposite 
signs or the same sign, and it is a positive multiple of 

/ft CD . 

^2 J - J Yf{h{a-b-ac)}f{h(a+b~a)}d{a- 0 i) - G{hb). (25) 

The fact that the integral of 0{hb) over all values of 6 is zero means 
simply that the probabilities, given the law, that the first two observa- 
tions will be on the same or opposite sides of the median are equal. 
For large 6 there will be an excess chance that they will be on opposite 
sides, for small 6 on the same side, and for continuous f(z) there will 
be a 6 such that the chances are equal. The result required is that 
there is only one such 6; this appears highly plausible but, as stated, 
has not been definitely proved except for special, though extremely 
different, forms of f(z). 

In a former presentation of the problem I took as a postulate that if 
Xy and x^ are the first two observations, x^ being the larger, and if x 
and a are initially unknown, then 

P(xi < X 3 < X 2 \Xi,X 2 ,H) = I, (26) 

I showed that only the rule 

P(dxda j H) oc dxdaja (27) 

can lead to this. The former, however, was really derived from 

P{Xi <X 3 < X 2 \x,a,H) = ^ (28) 

by an unconscious use of an argument analogous to that of 7.5. It is 
reasonable to say that the probability in (26) must be a constant 
independent of x^ and X 2 , but with a different power of a in (27) it would 
still be a constant but not and without some other principle it cannot 
be used to show that the dcr/cj rule is the only suitable one. The argu- 
ments for this are those of 3.1. But the principle (27) can be considered 
established otherwise, for complete previous ignorance of x and a, and 
then we may ask whether we should expect it to be seriously altered 
if there is any vague information about a such as we considered 
on p. 105. If the proper procedure is simply to truncate the prior 
probability law, and x^—x^ is much larger than the lower limit for a 
and much smaller than the upper, the effect on the posterior proba- 
bilities will be negligible. This is in accordance with common sense. 
But if we used dajo^+y we should be led to the result ^ for y = 1 and 
0 for y = — 1. The latter is the uniform distribution for a, and would 
lead also to the dxl\x\ rule for the posterior probability of x from two 



152 


ESTIMATION PROBLEMS 


Chap. Ill 


observations. Either would make a change in the probability distribu- 
tion for 3:3 that cannot be accepted. We cannot admit that vague 
information about the range of possible values can make appreciable 
changes when the difference of the first two observations does not lie 
near either extreme, and we avoid this by simply truncating the law; 
and then we find that this makes a negligible difference to the result. 
The conclusion then is that vague information may as well be neglected 
and treated as total ignorance. 


3.8. Correlation. Let the joint chance of two variables x and y be 
distributed according to the law 


P{dxdy I a, t, p, H) 

^ 

277 ctt( 1 — p^f>- r- 



Then the joint chance of n pairs {Xi,yi),{x 2 .,y^),..., {x„,y„) is 


( 1 ) 


P(d 1 a,r,p,H) 
1 

— (27roTf(l— 


1 /Sx^ Sy^ 



Put Sx^ = ns^, Sy^ = nP, Sxy = nrst. Then *•. t, and r are sufficient 
statistics for a, r, and p. 

We take a and t as initially unknown. In accordance with what 
appears to be the natural interpretation of the correlation coefficient, 
may be regarded as a sampling ratio, being the ratio of the 
number of components that contribute to x and y with the same sign 
to the whole number of components. Thus the prior probability of p, 
in the most elementary case, can be taken as uniformly distributed, and 


P{dadTdp \H) cc dodrdpjaT. 


( 3 ) 


If p is near -f-l or — L we may expect the rule to fail, for reasons 
similar to those given for sampling. But then it will usually happen 
also that one component contributes most of the variation, and the 
validity of the normal correlation surface itself will fail. The best 
treatment will then be to use the method of least squares. But in the 
typical case where the methods of correlation would be used we may 
adopt (3). Then, combining (2) with (3), we have 


P{dcfdrdp I dH) 

1 f —n Is^ . 2prst\)dcTdTdp 

{ot)"( 1— |2(1— p®)\(7* T* CTT j| OT 


w 



§ 3.8 


ESTIMATION PROBLEMS 


153 


The posterior probability distribution for p can be obtained by the 
substitution, due to Fisher, 


Ht 


a, 


HT o 

at = 


whence 


— (xe^. 






d(a, t) fit 


(5) 

( 6 ) 


P(dp\eH)ccdp ^ J ^ exp j - ( cosh ft-rp)^ dod^ 


0 —00 


OC 



i}-p"yt" dB 

(cosh B — pr)'^ 


(7) 


since the integrand is an even function of B- At this stage the only 
function of the observations that is involved is r. so that r is a sufficient 
statistic for p. If we now put 

cosh 8— pr = i — — (8) 

l—u 

the integral is transformed into 


{i-prr-'P- 


/ 


V(2w) ' 


i(14-r/>)w}-''-' du. 


(9) 


Since r and p are at most equal to 1, we can expand the last factor in 
powers of u, and integrate term by term, the coefficients being beta 
functions. Then, apart from an irrelevant factor, we find 

P(dp I eH) OC I~^^~l~l8„(pr)dp, ( 10 ) 


where 


^,.( P ^) - 1 + -^-1 


l + rp (^±^pY4 

8 ■^2!(n+i)(7i+f)\' 8 7 


(11) 


a hypergeometric series. In actual cases n is usually large, and there 
is no appreciable error in reducing the series to its first term. But the 
form (10) is very asymmetrical. We see that the density is greatest 
near p — r, but since p must be between ±1 there must be great 
asymmetry if r is not zero. This asymmetry can be greatly reduced by 
a transformation, also due to Fisher, 


tanh(| = pi 


tanhs = r; 


^ = z+z, 


(12) 



154 


ESTIMATION PROBLEMS 


Chap. Ill 


SO that the possible values of ^ and z range between This gives 


P(d? I BH) X 


dj 

cosh™+*^ cosh"“*^ 2 (l — tanhz tanh f 


cc 


dC 

cosh‘/>*5 cosh-“^2 cosh"“’%’ 


(13) 


a power of cosh 2 having been introduced to make the ordinate 1 at 
a; = 0. The ordinate is a maximum where 


— logcosh^+(re— l)logcosha:] = 0 , 
dx 

or — I tanh (n— i)tanha; = 0 . 

When n is large, x is small, and we have, nearly, 

_ ^ 

* ‘In' 

The second derivative is 


(U) 

(16) 

(16) 


— |-sech®^— (n— J) 3 ech 2 a: = — « nearly. (17) 

sech^ can range from 0 to 1 , so that the second derivative can range 
from — (n— ^) to — ( 7 i+ 2 ). Hence for large n we can write 




5r 1 

2 — — + -r . 
Vw 


(18) 


The distribution (13) is nearly symmetrical because the factor raised to a 
high power is sechx, and it can be treated as nearly normal. Returning 
now to the series <S'„(pr), we see that its derivative with regard to p is of 
order Ijn, and would displace the maximum ordinate by a quantity 
of order Ijn^ if it was allowed for. But since the uncertainty is in any 
case about 1 /V?i it is hardly worth while to allow for terms of order Ijn, 
and those of order Ijn^ can safely be omitted. 

In most cases where the correlation coefficient arises, the distribution 
of chance is not centred on ( 0 , 0 ) but on a pair of values ( 0 , 6 ), which 
also have to be found from the observations. Then we must take 


Pidadbdadrdp | H) oc dadbdadrdpjo 
and replace x and y in ( 1 ) by x—a and y—b. Then 
g {x-a)^ ^ {y-bY ^^^ {x-a)(y-b) 


(19) 




(a—xY , (b—yY 2p{a—x){b—y) 


'll 

)+"b+r. 


2pr8t\ 


or 


( 20 ) 



§ 3.8 

where now 


ESTIMATION PROBLEMS 


165 


nx = Sx, 

Then 


m?7 = 8y, ns^ — S(x—x)^, 
nrst = S(x—x){y—y). 


nt^ = S{y—yf, 

( 21 ) 


P(dadbdadTdp \ 6H) 
1 


oc 


(aT)''+>(I-p2jV. 


exp 


2(I-p2)( 


(a—xf [b—yf 2p(a—x){b—y) 


)- 


n /6‘^ 


P 2prst\ 
or jj 


dadbdadrdp. (22) 


Integration with regard to a and b then gives 


P{dodrdp I eH)oz ----- ^.,T- + 

((Tt)''(1 — (2(1 — P^)\o^ T® OT II 

(23) 


Applying tlie transformations (5) and integrating with regard to a and y8 
will therefore only give an irrelevant function of n as a factor and 
replace n in (10) by n — 1. Hence, in this case, 


P{dp \dH)cc 


(I— pr)"-’''» 


S„-i{pr)dp 


(24) 


and, to the order retained, ^ will still be given by (18). A slight change 
may perhaps be made with advantage in both cases. In the former, 
if 7?, == 1, r will necessarily be ±1 whatever p may be; in the latter 
this will hold for n ~ 2. A permissible change will express this 
indeterminacy by making the uncertainty of C infinite in these cases. 
Thus in the former we can write 


and in the latter 


_ 6r 1 

_ 6r 1 

2n^ ^{n—2y 


(26) 

(26) 


Fisher’s theory of the correlation coefficientt follow's different lines, 
but has suggested several points in the above analysis. He obtains the 
result that I should write 


P(dr \a,b,a,T,p,H) oc 


( 1 1 _ r2)V.a(»i -4) 


(1-pr) 




S„.M)dr, (27) 


and as this is independent of a, 6, o, and t we can drop these and replace 
the left side by P{dr Ip//). Also if we take the prior probability of p 


t Biometrika, 10 , 1915, 609-21; Matron, 1 , 1921, Part 4, 3-32. 



ESTIMATION PROBLEMS 


156 


Chap. Ill 


as uniformly distributed, since r and dr are fixed for a given sample, 
this leads to 

Pi.ip\rH)cc dp. (2«) 

which is identical with (24) except that the complete data d are replaced 
by r. This amounts to an alternative proof that r is a sufficient statistic 
for p; the data contain no information relevant to p that is not contained 
in r.f 

The bias shown by the second term in (25) and (26) is usually negligible, 
but requires attention if several equally correlated series are likely to 
be combined to give an improved estimate, since it always has the 
same sign. The question here will be, how far can the series be supposed 
mutually relevant ? We cannot combine data from series with different 
correlation coefficients. But if the correlation is the same in all series 
we still have three cases. 

1. a,b,a,T the same in all series. Here the best metliod is to combine 
the data for all the series and find a summary value for r from them. 
The second term in ^ will now be —5rj2'^n, which will be utterly 
negligible. 

2. a, b different in the series, a, t the same. Each pair (a, 6) must 
now be eliminated separately and we shall be left with 


Pldadrdp |0H)oc 


,X 


_ ( 1 (Tris^ '^nt^ 2pynrst\] , , , 

The data, therefore, yield a summary correlation coefficient 


E = 


2 nrst 

(2 ns^YH^ 


(30) 


and we proceed as before; the second term will be — 5i?/2 — 1). 

3. a, b, a, t all different. Here a and t must be eliminated for each 
series separately, before we can proceed to p. In this case we shall be 
led to the forms 

P(dp|0H)x^j^^,dp. (.31) 

P(«f^ I OH) oc n cosh"-%({-z) ’ 


t Proc. Roy. Soc. A, 167 , 1938, 464-76. 



ESTIMATION PROBLEMS 


157 


where ^ is the number of series. The solution will therefore be, approxi- 
mately, ^ ^ (»-|)z-(|p+2)tanh ? (33) 

or, if we take Z as the weighted mean of the values of z and tanh Z — R, 

i^Z- i?± . (34) 

The accuracy is similar to that of (18). The bias shown by the second 
term will in this case persist, and must be taken into account if many 
series are combined, since it will remain of the same order of magnitude 
while the standard error diminishes. This point is noticed by Fisher. 
The 2 in the numerator comes from the fact that if P(dp | H) oc dp, 
P(d5 1 //) QC sech^^d^. It therefore only appears once and its effect 
diminishes indefinitely as series are combined, but the extra \ in (26) 
comes from the likelihood and is repeated in (34) by every series. 

If a and t in the correlation law are originally known, (4) will be 
replaced by 

Thus r is no longer a sufficient .statistic for p; s and t are also relevant. 
The maximum posterior density is given by 


p3_ 2 ^ 1 = 0. 

cn \<7^ / (7T 


If r is positive, this is negative for p = 0, and equal to 

6'* P 2rs< 

o "I n 

a- O-T 

for p = +1, and this is positive. For p = r it is equal to 




which vanishes if s = <7, < = t. Thus if s and I reach their expectations, 
r remains the best estimate of p. But (38) is negative if sja and </t are 
very small, positive if they are large, and in the former case the best 
estimate of p will be larger, in the latter smaller than r. The reason is 
that if the scatters are unusually large it is evidence that too many 
large deviations have occurred; if there is a positive correlation at all and 
this is found in both variables, the most likely way for this to happen 
would be by way of an excess of deviations where x and y have the 
same sign, and the correlation in the sample would tend to be more 
than p. 



158 


ESTIMATION PROBLEMS 


Chap. Ill 


It is unusual in practice, however, for a and t to be well enough known 
for such supplementary information about p to be of much use. 


3.9. Invariance theory. If we have two laws according to which the 
chances of a variable x being less than a given value are P and P' , 
any of the quantities 

4 = J \{dP')V”^- {dPfi«^ |», j = ^\og-^d{P'-P) ( 1 ) 

has remarkable properties. They are supposed defined in the Stieltjes 
manner, by taking SP, SP' for the same interval of x, forming the 
approximating sums, and then making the intervals of x tend to zero, 
and therefore may exist even if P and P' are discontinuous. They are 
all invariant for all non-singular transformations of x and of the para- 
meters in the laws; and they are all positive definite. They can be 
extended immediately to joint distributions for several variables. They 
can therefore be regarded as providing measures of the discrepancy 
between two laws of chance. They are greatest if SP vanishes in all 
intervals where SP' varies and conversely; then 4 = 2, J = oo. They 
take these extreme values also if P varies continuously \vith x, and P' 
varies only at isolated values of x. The quantities 1^ and J are specially 
interesting. Put ~ 8P„ pj. = SP^ for the interval hx^. Let p^ 
depend on a set of parameters «,• (i = 1 to m) \ and let pj. be the result 
of changing to aj-f- where Aa^ is small. Then, if p,. is differentiable 
with respect to we have to the second order, using the summation 
convention with respect to i, k. 



where 

Also 


= 9'.&-Aa,.Aa*, 

(3) 


(4) 

-^2 

(5) 


to the same accuracy. Thus J and 44 have the form of the square of 
an element of distance in curvilinear coordinates. If we transform to 
any other set of parameters aj, J and 44 are unaltered, and 

J = ^^/Aa^Aaj, (6) 


, doLi Saj. 

9ji S'ffc o # o 

OOLj OXi 


where 


(7) 



S3.9 

Then 


ESTIMATION PROBLEMS 
>1 

\\9n\\ — \\9ik 




doLj 

d<xi 


But in the tranaformation of a multiple integral 


doLy^ d<x^...da.ji 


dcx’j 


d!x\...docjn 


Hence 




(8) 

(9) 

(10) 

( 11 ) 


This expression is therefore invariant for all non-singular transforma- 
tions of the parameters. It is not known whether any analogous forms 
can be derived from I^ii m ^ but the form of is then usually 
much more complicated. 

In consequence of this result, if we took the prior probability density 
for the parameters to be proportional to it could be stated for 

any law that is differentiable with respect to all parameters in it, and 
would have the property that the total probability in any region of 
the oLi would be equal to the total probability in the corresponding 
region of the ocj-, in other words, it satisfies the rule that equivalent 
propositions have the same probability. Consequently any arbitrariness 
in the choice of the parameters could make no difference to the results, 
and it is proved that for this wide class of laws a consistent theory of 
probability can be constructed. Hence our initial requirement 2 (p. 8) 
can be satisfied for this class; it remains to be seen whether the desirable, 
but less precise or fundamental requirement 7 (p. 10) is also satisfied. 

For the normal law of error 




(12) 

we have exactly, if 



a — o' = 


(13) 

/ V(2.)[va-"‘’( '4,-’) '4,= ’)] 

— 00 

2 

dx 


= "[* 4 l>+i>))] 




= (14) 



160 


ESTIMATION PROBLEMS 


Chap, in 




= 2 sinh^S+cosh ^ 
To the second order 


(A'-A)2 




2(7* I. 


dx 


(15) 

(16) 


Three cases arise. If a is fixed, the coefficient of (dA)* is constant, 
giving a uniform prior probability distribution for A over the range 
permitted, in accordance with the rule for a location parameter. If A 
is fixed, ||g’jj,||*^=d(7 0 C daja, again in accordance with the rule that we 
have adopted. This rule, of course, has itself been chosen largely for 
reasons of invariance under transformation of a. But if A and a are 
both varied, \\gi^\\^^"^dM,a a: dMajo^ instead of dXdajcr. If the same 
method was applied to a joint distribution for several variables about 
independent true values, an extra factor l/a would appear for each. 
The index in the corresponding t distribution would always be 
however many true values were estimated. This is unacceptable. In 
the usual situation in an estimation problem A and a are each capable 
of any value over a considerable range, and neither gives any appreciable 
information about the other. Then if we are given — i)/ < A < M, 
aj < CT < ct 2 > should take 


P(dA I H) = dXj^M, P{d(j | H) — dajolog^rj^lo-^. 


P{dXd(7 I H) = P(dX I H)P{do j H) = 


dXda 

27lf(7log(a2/oi)’ 


(17) 


The departure from the general rule is thus explicable as due to the 
use of a previous judgement of irrelevance. 

There is no trouble for <7 alone or A alone; it arises when they are 
considered both at once. Now take a law such as that of partial 
correlation 

P{dxi...dx„ I = A exp(- W) XT dx„ 

where W" = 2 “i/t 

and the x^ are a set of observables. Here for each Xf there is a corre- 
sponding scale parameter < 7 ^ and the a,*, are numerical coefficients. It is 



ESTIMATION PROBLEMS 


161 


clear from considerations of similarity that J, to the second order, is a 

quadratic in (derja^), and that 1 | will be of the form JJ where 

1 

5 is a numerical factor depending on the a.i^. Hence the rule leads to 
P(d<7i dotkm i H) OC XI (doJa^)B'^^ IX ^ 

i 

which is what we should expect. There is no difficulty in the introduc- 
tion of any number of scale parameters. 

We can then deal with location parameters, on the hypothesis that 
the scale and numerical parameters are irrelevant to them, by simply 
taking their prior probability uniform. If A and a are location and scale 
parameters in general, and the numerical parameters are a,-, we can take 

P{dXda XI d<Xi I H) oc dX\ \'l^da XX doi^, (19) 

where found by varying only a and the ctf, and is equal to 

Ija times a function of the This is invariant for transformations of 
the form ^ A+or/(a..), (20) 

which is the only form of transformation of A that we should wish to 
make. 

If a is already uniquely defined, a satisfactory rule would be 

PidXda XX doc, 1 H) oc dX^ I \g,,\ 1’/^ n doc,, (21) 


where is now found by varying only the a,-, keeping A, a constant. 

Again, take a Pearson Type I law A(x— Cj)’"*(c 2 — x)'"*da;. For any 
non-zero change of Cj or Cj, J is infinite. 4 is not of the second order 
in Aci, Acj imless m,, If we evaluate the coefficients in the 

differential form by integration, e.g. 

Cl 

This diverges unless nij > 1. Thus the general rule fails if the law is 
not differentiable at a terminus. But the case where either of TOi, mg < 1 
is precisely the case where a terminus can be estimated from n observa- 
tions with an uncertainty o(n“*^»), and it is then advantageous to take 
that terminus as a parameter explicitly; the occasion for transformation 
of it no longer exists. If one of mj, nij ^ 1 it is natural to take or Cg 
respectively as location parameter ; if both are ^ 1 , it is equally natural 
to take ^(Cj-f-Cg) as location parameter and ^(Cg— Cj) as scale parameter. 
In either case we need only evaluate the differential form for changes of 
the other parameters and find a prior probability for them independent 



162 


ESTIMATION PROBLEMS 


Chap. Ill 


of c^, Cg, or both, as the case may be. It is interesting to find that an 
apparent failure of the general rule corresponds to a well-known excep- 
tional case in an estimation problem and that the properties of this case 
themselves suggest the appropriate modification of the procedure. 

For the comparison of two chances a, a we have 


= 2-2V(c«a')-2V(l-ct)V(l-c^'). (23) 

This takes a simple form if we put ot = 8in%, a' = sin^a'; 


= 4ain*^(a'— a) (a'— a)^. 

The exact form of J is more complicated: 

0£'(1 — a ) 


J — {a! — a;)log 


Then the rule (11) gives 


a(l — <x') 
1 da. 


Pida\H)^ldu = 

Tt Tt ^j[a(l — a)} 


(24) 


(25) 


(26) 


This is an interesting form, because we have already had liints that 
both the usual rule da and Haldane’s rule 


P(dalff)oc-^ 

a(l — a) 

are rather unsatisfactory, and that something intermediate would be 
better. 

For a set of chances a^ (r — ^ a, = 1) we find 


Then 


4 = 2-2 2 VK(.,+A»,)) = i y 

,m-\ v2 

4 ^ a,. 4 a„, 




nv 

P{da,...da^_^E)ai^-^^p^^. 

y(n“^) 


(27) 


(28) 

(29) 


The rule so found is an appreciable modification of the rule for 
multiple sampling given in 3.23, and is the natural extension of (26). 

'f>s “'’f® *wo sets of exhaustive and exclusive alternatives, tf)^ being 
irrelevant to ifj^, with chances a,., )3g (r = 1 to m, s = l to n) the chance 



ESTIMATION PROBLEMS 


163 


of is a^)8,. If we vary both a, and jS, and consider and J for the 
changes of oc^^g, we get 

/a == 2 — 2 2 2 V(“r^8«rft) 


= 2-2(l-i7,.J(l-J4^), (30) 

^ '^rPs ^ 

= 2 (4_a,)log^+2 {yS^-^Jlogf 

suffixes a, /3 indicating the values if a,, |8g are varied separately. Hence 
for probabilities expressible as products of chances log(l — and J 
have an exact additive property. The estimation rule then gives 


which is satisfactory. 

Now consider a set of quantitative laws with chances a,, 
true, the chance of a variable x being in a range dz is f^(x, a,.j,.. 

P{<f)^dx I <Xf,(x^g,H) = <xj,{x,ari,...,<x,„)dx. 


(32) 
If is 

.,(Xr„)dx, 

(33) 


For variations of both the and the a„> 

h =2-2 2 V(M«r + A«,)} J V{/r{/r + A/.)} dx 

= 2-2 V{M“r+A«.)}(2-4,,) 

= 4a+ 2 h.r (34) 

and, to the second order, 

/2=4.a+I«rV (35) 

/j,. is the discrepancy between with parameters and with para- 
meters cv«-f- Aa„. If we form | |^,t( for variations of all ot, and all ix^g, 
the rule will then give the same factor depending on the as for 
estimation of a„ when is taken as certain. But for every a,g a factor 
(x'J^ will enter into | !'■'*, and will persist on integration with regard 

to the a^g. Hence the use of the rule for all a,, and all cx^g simultaneously 
would lead to a change of the prior probability of for every parameter 
contained in /,. This would not be inconsistent, but as for scale para- 
meters it is not the usual practical case, a, is ordinarily determined 
only by the conditions of sampling and has nothing to do with the 
complexity of the/,.. To express this, we need a modification analogous 
to that used for location parameters; the chance o^, like a location 
parameter, must be put in a privileged position, and we have to con- 
sider what type of invariance can hold for it. 



164 


ESTIMATION PROBLEMS 


Chap. Ill 


The general form (11) gives invariance for the most general non- 
singular transformations of the parameters. In this problem it would 
permit the use of a set of parameters that might be any independent 
functions of both the a, and the o^g. In sampling for discrete alterna- 
tives it is not obvious that there is any need to consider transformations 
of the chances at all. 

If we take m-i 

d(X^ m Hr 

P(n do^r n doc,, I H) OC Yl I n 

J{U OC,) .=1 

where is based on comparison of f, with f,-]-Af,, we shall still 

have invariance for all transformations of the oc, among themselves and 
of the oc,g among themselves, and this is adequate. If we do not require 
to consider transformations of the a, we do not need the factor (PJ 
If some of the are location and scale parameters, we can use the 
modification (19). (36) can then be regarded as the apjjropriate exten- 
sion of (32), which represents the case where oc,, = independent of r. 

For the Poisson law 

A.m 

P(m j r^) = e-'---- (37) 

ml 

we find ^ 2-2exp{-J(Vr'-Vr)*}, 

J = (r'—r)log{r'lr), 
leading to P(dr | H) oc dr/Vr. (39) 

This conflicts with the rule dr/r used in 3.3, which was quite satis- 
factory. The Poisson parameter, however, is in rather a special position. 
It is usually the product of a scale factor with an arbitrary sample size, 
which is not chosen until we already have some information about the 
probable range of values of the scale parameter. It does, however, 
point a warning for all designed experiments. The whole point of general 
rules for the prior probability is to give a starting-point, which we take 
to represent previous ignorance. They will not be correct if previous 
knowledge is being used, whether it is explicitly stated or not. In the 
case of the Poisson law the sample size is chosen so that r will be a moder- 
ate number, usually 1 to 10; we should not take it so that the chance of 
the event happening at all is very small. The dr/r rule, in fact, may 
express complete ignorance of the scale parameter; but dr/Vr may 
express just enough information to suggest that the experiment is 
worth making. Even if we used (39), the posterior probability density 
after one observation would be integrable over all r. 




§ 3.9 


ESTIMATION PROBLEMS 


165 


For normal correlation we get 

T _ o I o^lo'^+r’^lf'^—^pp'orlaT a'^la^+T’^lT^—2pp'a'T' jar , , 

X {a'V'®(l — p'^)-\-a^T^{l — p^)-{-a^T'^-\~a'^r^ — 2pp'aa'rT']~^^^. (41) 

If we put 

a' = cre^'*, t' = re^”, p = tanh p' = tanh (42) 
and change the parameters to u-\-v, u—v, we get, to the second order 
in u, V, C' — ^, 

J= (l+tanh20(r-C)^- 

~4tanh — ^){u-f-v)-\-4(u-\-v)^~\-4(u—v)~eo&\i^i!l„ (43) 

1 19,^11 = 64co8h2^, (44) 


(46) 

The modifications of the analysis of 3.8, when this rule is adopted, 
are straightforward. The divergence at p = i 1 is a new feature, and 
persists if there is one observation, when r is ±1. If there are two 
observations and r 9 ^: ± 1 the posterior probability density for p has a 
convergent integral, so that the rule gives intelligible answers when the 
data have anything useful to say. 

In problems concerned with correlations the results will depend 
somewhat on the choice of parameters in defining J . From (43) we can 
write J for small variations as 


J = (r-0^+4 cosh2^(«-^;)2+{2(w^-r)-tanh (^6) 

Now a and t can be regarded as parameters defined irrespectively 
of p\ for whatever p may be, the probability distributions of x and y 
separately are normal with standard errors u, t. Thus we may analyse 
the estimation of a correlation into three parts; what is the probability 
distribution of x? what is that of y^ and given those of x and y sepa- 
rately, does the variation of y depend on that of x, and conversely? 
In this analysis we are restricted to a particular order of testing and 
in giving the prior probability of t we should evaluate J with a and t 
fixed. In this case (40) becomes 

(l+pp')(p— p')® 


J = 


(l_p2)(l_p-2) 


(47) 


P(dp\aTH)oc ^h^P-pdp. 

I — p* 


and 


(48) 



lee 


ESTIMATION PROBLEMS Chap. Ill 


From the interpretation of a correlation coefficient in terms of a 
chance (2.5) we should have expected 

P{^dp\arH) = ^-^ (49) 

77 v(i— r) 

This is integrable as it stands and would be free from objection in any 
case where the model considered in 2.3 is known to be representative 
of the physics of the problem. 

The different rules for p correspond to rather different requirements. 
(46) contemplates transformations of p, a, t together, (48) transforma- 
tions only of p, keeping a, r fixed. (49) does not contemplate trans- 
formations at all, but appeals to a model. But the rule for this model 
itself is derived by considering transformations of a simple chance, and 
the need for this is not obvious. We really cannot say that any of these 
rules is better than the uniform distribution adopted in 3.8. 

These rules do not cover the sampling of a finite population. The 
possible numbers of one type are then all integers and differentiation 
is impossible. This difficulty does not appear insuperable. Suppose 
that the population is of number n and contains r members with the 
property. Treat this as a sample of n derived from a chance a. Then 


P{da I nH) 
P(r 1 n, (xH) 
P(rdx I nH) 
P(r I nH) 


da 


7r^{a(l—a)y 
n\ 


rl (n—r)! 


ar(l_-a)..-r^ 


ni 


a’'-'^2( 1 - a)” da, 

— r \ ' 


Trr! (»— r)! 

(r— ^)!(ra— r— ^)! 
ttt! (n—r)\ 


(50) 


This is finite both for r = 0 and r = n. 

To sum up the results found so far; 

1. A widely appUcable rule is available for assessing the prior proba- 
bihty in estimation problems and will satisfy the requirement of con- 
sistency whenever it can be applied, in the sense that it is applicable 
under any non-singular transformation of the parameters, and will lead 
to equivalent results. At least this proves the possibility of a consistent 
theory of induction, covering a large part of the subject. 

2. There are many cases where the rule, though consistent, leads to 
results that appear to differ too far from current practice, but it is still 
possible to use modified forms of the rule which actually have a wider 



§3.9 ESTIMATION PROBLEMS 197 

n 

applicability. These cases are associated with conditions where there 
is reason to take the prior probabilities of some of the parameters as 
independent of one another. 

3. The rule is not applicable to laws that are not differentiable with 
regard to all parameters in them; but in this case a modification of the 
rule is often satisfactory. 

4. In some cases where the parameters themselves can take only 
discrete values, an extension of the rule is possible. 

Further investigation is desirable; there may be some other method 
that would preserve or even extend the generality of the one just 
discussed, while dealing with some of the awkward cases more directly. 



IV 

APPROXIMATE METHODS AND SIMPLIFICATIONS 


‘Troll, to thyself be true — enough.’ 


Ibsen, Peer Oyrd. 


4.0. Maximum likelihood. If a law containing parameters a, /3, y,... 
and a set of observations 6 lead to the likelihood function L(a, j3, y,...), 
and if the prior probability is 


P{dcxd^dy... [H) oc/(oi,j8, y,...) dadjSdy..., (1) 

then P{docd^dy... \ dH) cc dotd^dy... . (2) 

There will in general be a set of values of a, | 9 , y,..., say a, b,c,... that make 
L a maximum. These may be called the ‘maximum likelihood solution’. 
Then if we put a = «+«', and so on, we can usually expand log/ and 

log L in powers of a', y', Now the maximum posterior probability 

density is given by 1 1 df 


= 0 

L da f Sol 


(3) 


with similar equations. The prior probability function / is independent 
of n, the number of observations; logL in general increases like n. 
Hence if (a', jS', y',...) satisfy (3), they will be of order 1/n. 

Also, if we neglect terms of order abov^e the second in log L and log/, 
the second derivatives of log Lf will contain terms of order n from log L, 
while those from log/ do not increase. Hence for a', y',... small, the 

quadratic terms will be 

— n<f>z{oL,^',y',...)-\-0{a'^,^"^,y'^,...), (4) 


where <f >2 is a positive quadratic form independent of n. Hence the 
posterior probability is concentrated in ranges of order and this 
indicates the luicertainty of any possible estimates of a, /3, y,... . But 
the differences between the values that make the likelihood and the 
posterior density maxima are only of order Ijn. Hence if the number of 
observations is large, the error committed by taking the maximum 
likelihood solution as the estimate is less than the uncertainty inevitable 
in any case. Further, the terms in log Lf that come from L are of 
order n times those from /, and hence if we simply take the posterior 
density proportional to L we shall get the right uncertainties within 
factors of order 1/n. Thus the errors introduced by treating the prior 
probability as uniform will be of no practical importance if the number 
of observations is large. 

The method of maximum likelihood has been vigorously advocated 



5 40 APPROXIMATE METHODS AND SIMPLIFICATIONS 


169 


by Fisher; the above argument shows that in the great bulk of cases 
its results are indistinguishable from those given by the principle of 
inverse probability, which supplies a justification of it. An accurate 
statement of the prior probability is not necessary in a pure problem of 
estimation when the number of observations is large. What the result 
amounts to is that unless we previously know so much about the 
parameters that the observations can tell us little more, we may as 
well use the prior probability distribution that expresses ignorance of 
their values; and in cases where this distribution is not yet known there 
is no harm in taking a uniform distribution for any parameter that 
cannot be infinite. The difference made by any ordinary change of 
the prior probability is comparable with the effect of one extra obser- 
vation. 

Even where the uncertainty is of order 1/w instead of this may 
still be true. Thus for the rectangular distribution we had L cc a~", 
while Lfoc The differences between the ranges for a given 

probability that the quantity lies within them, obtained by using L 
instead of Lf, will be of order l/n of the ranges themselves. 

4.01. Relation of maximum likelihood to invariance theory. 
Another important consequence of (1) and (3) is as follows. In 4.0(2) 
we have, taking the case of three unknowns, 

P(e I a^yH) OC L, 

where L depends on the observations and on a, y. a, b, c are the values 
of a, )3, y that make L a maximum, the observations being kept the 
same. Then for given a, )3, y we can find by integration a probability 
that a, b, c lie in given intervals da, db, dc. This does not assume that 
a, b, c are sufficient statistics. Then when n is large L is nearly propor- 
tional to exp{-^wg'i*(ai<-o,)(afc— a*,)} JJ 

and all parameters given by maximum likelihood tend to become 
sufficient statistics. Further, the constant factor is (nj 2Tr 
and it is of trivial importance whether is evaluated for the actual 
values oLf or for cx^ = 0 ^. Hence if we use for the prior proba- 

bility density, the probability distribution of cq— is nearly the same 
when n is large, whether it is taken on data or o^; this is irrespective 
of the actual value of a,-. 

Mr. P. H. Diananda has suggested, on this account, that we could 
state an invariance rule for the prior probability in estimation problems 
as follows. Take, for n large, 

P(q£< < a,. < af-|-da< ==/(«<) JJ doc^, 



170 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

where i covers all parameters in the law; then if we take 
P{dai 1 H) oc fioi) n 

we have a rule equivalent to the | 1’^'* rule where the latter is applic- 

able. It also works for -the rectangular distribution. A similar rule was 
given independently by Mr. Wilfred Perks, who, however, considered 
only one parameter, f 

Again, in the argument of 3.9 we considered only the values of the 
invariants for one observation, except that we showed that for sets of 
observations derived independently from the laws J and log(l — 
have an additive property. This argument is no longer applicable if the 
observations are not derived independently; this happens in problems 
where the law predicts something about the order of occurrence as well 
as about their actual values. But it now appears that we can con- 
sistently extend the rule to cover such cases. If two laws give 

Pid \oc,H) = L(0,a..); P(e 1 oc-H) = L(e, a;.), 

we can take .. r /q 

JuiPyQLjj 

-log(l-i/,) = -limi 

summations being over the possible values of 6. Both reduce correctly 
when the observations are derived independently. 

4.1. An approximation to maximum likelihood. In all the problems 
considered in the last chapter sets of sufficient statistics exist. This is 
far from being a general rule. It fails indeed for such a simple form as 
the Cauchy law 

P{dx I = a dxl\n{o^-\-{x — a)*}]. 

If we have n observations the likelihood is not capable of being 
expressed in terms of the unknowns a, a and any two functions of the 
observed values of the x'b. For most of the Pearson laws there are no 
sufficient statistics. The method of maximum likelihood is applicable 
to such cases, but is liable to be very laborious, since logL must be 
worked out numerically for at least three trial values of each parameter 
so that its second derivatives can be found. The result has been, to a 
very large extent, that where sufficient statistics do not exist for the 
actual law, it is replaced by one for which they do exist, and information 
is sacrificed for the sake of ease of manipulation. There is a definite need, 
t J- Inst. Aciuaritt, 1947, 1-28. 



§4.1 APPKOXIMATE HJETHODS AND SIMPLIFICATIONS 


171 


therefore, for a convenient approximate method that will not lose much 
of the accuracy given by maximum likelihood but will be reasonably 
expeditious. 

In practice, with almost every method, observations are grouped by 
ranges of the argument before treatment. Thus effectively the data 
are not the individual observations but the numbers in assigned groups. 
Suppose then that the number in a group is n,., and the total number N. 
According to the law to be found the expectation in the group is and 

2 ( 1 ) 
is the chance, according to the law, that an observation will fall 
in the rth group, and is a calculable function of the parameters in the 
law. Then the joint chance of observations in the first group, in 
the second, and so on, is 


L = 


u-hMT-nh>UW'U(^T- 

The rn^ are the only unknown quantities in this expression, and only 
the last factor involves their variations. Now put 

TO,. = (3) 

where < n^, and where 

I Or = 0. (4) 


Then 


logZ — con.staiit-f ^ ^og|l + j 

, . , V' la a^N\ , 

== constants > nJ~ + 

^ V "r 


= constant - 


= constant 


2 

--^2 


Naf 

(to,.— n,)2 


( 5 ) 


since the first order terras cancel by (4). Hence, apart from an irrelevant 
constant, we have 

( to ,.— 


logL = = - 2 




(6) 


x'^ differs from Pearson’s x^ having in the denominator 

instead of m^. The difference will be of order {m^—n^)^ln^, which is of 
the order of the cubic terms neglected in both approximations. But 
this form has the advantage that the are known, while the to, are 
not. We can write the observed frequencies as equations of condition 

w»r = ”r± C^) 



172 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

and then solve for the parameters in by the method of least squares, 
with known weights. Pearson’s form is equivalent to this accuracy — it 
is itself an approximation to — 2logL, apart from a constant — but 
would require successive approximation in actual use on account of the 
apparent need to revise the at each approximation. It does not 
appear that minimum y* has actually been much used in practice, 
possibly for this reason. There are some references in the literature to 
the fitting of frequencies by ‘least squares’, but the weights to be used 
are not stated and it is not clear that minimum y* is meant. The errors 
due to treating all values of w, as having the same accuracy would be 
serious. The present form was given by Dr. J. Neymanf and redis- 
covered by myself,! Neyman’s paper having apparently attracted little 
attention in this country. The great difficulty in calculating logL 
completely is that it usually requires the retention of a large number 
of figures; in actual cases log^o L may be —200 to —600, and to find the 
standard errors to two figures requires that the second decimal should 
be correct. But in this method most of logD is absorbed into the 
irrelevant additive constant, and we have only to calculate the changes 
of the given N, for a set of given small changes of the parameters. 

The method fails if any of the are zero, and is questionable if any 
of them are 1. For unit groups there appears to be no harm in writing 

w, = Irhl (8) 

because if a parameter depends on a single unit group it will be 
uncertain by its full amount in any case; while if it depends on p unit 
groups the equations derived by using (8) for each can be summarized by 

= (9) 

which is right. But special attention is needed for empty groups. 

Referring to (2) we see that if = 0, (mJNy‘' = 1 for all values of m,. 
If M is the sum of the values of over the empty groups, we can still 
make the substitution (3), but we shall now have 

= — if, (10) 

log L = constant— ^ V ^ — if, (11) 

where the summations are now over the occupied groups. Hence if 
there are empty groups we can take 


t BtUl. Inat. Intern, de Statiatique, Warsaw, pp. 44-86 (1929). 
J Proc. Comb. Phil. Sec. 34 , 1938, 166-7. 


(12) 



§ 4.1 


APPROXIMATE METHODS AND SIMPLIFICATIONS 


173 


the summation being over the occupied groups, and M being the total 
expectation according to the law in the empty groups. The term —M in 
log L corresponds to the probability e~’’ for a zero result according to the 
Poisson law. This form does not lend itself to immediate solution by least 
squares. In practice, with laws that give a straggling tail of scattered 
observations with some empty groups, it is enough to group them so 
that there are no empty groups, for a terminal group being calculated 
for a range extending to infinity. Then ( 7 ) can always be used.f 

4.2. Least square equations : successive approximation. It often 
happens that a large number of the coefficients in the normal equations 
are small or zero. In the extreme ca.se, where all coefficients not in the 
leading diagonal vanish, the equations are said to be orthogonal. In 
the other extreme, where the determinant of the coefficients vanishes, 
the solution is indeterminate, at least one unknown being capable of 
being assigned arbitrarily. In all intermediate cases the determinant is 
less than the product of the diagonal elements; if it is much less, the 
solution may be called badly determined. The solution can, in theory, 
always be completed on the lines of 3 . 5 , but it often happens that there 
arc, effectively, so many unknowns that it is desirable to do the work 
piecemeal. Two methods of successive approximation are often suitable. 

Consider the form 

2ir = /qia;f+26i2.Ti.r2 + 622a-|+...— — 2d2a:2— ... + e, (1) 

and the normal equations 

= dj, (2) 

^12 ••• ~ ^ 2 ’ (^) 


We can proceed by the following method, due to von Seidel. In (2) 
neglect all terms in .r2,... and take, therefore, — djbn. Now if all 
the x’s are 0 , 2 W = e. If we take Xj = and all the others 0 , 


^11 ^11 


(4) 


so that this substitution always reduces W. Now' make this substitution 
in ( 3 ) and neglect X3,X4,... . Then we have the approximation 

^ 22^2 = ^2 (fi) 

and W is reduced by a further amount 


^ (j ^12 *^ 1 ^^ 

'>221 " ) 


(6) 


t For numerical illustrations see Ann, Engen, 11 , 1941, 108-14. 



174 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


So we may proceed, substituting in each equation the approximations 
already found. On reaching the end we begin again at the first equation, 
using the first approximations for to Since W is diminished each 
time the process must converge, and often does so very rapidly. An 
analogous method has been given recently by R. V. Southwell and 
A. N, Black under the name of the progressive relaxation of constraints, 
from an analogy with problems of elasticity.! 

The following method is sometimes quicker but does not necessarily 
converge. Begin by transferring all terms of the normal equations to 
the right side, except the diagonal terms, thus: 

^11^1 “ ^12^2 ( 7 ) 

^22^2 ~ ^2 ^12^1 ^ 22^3 •■•> (®) 

The first approximations are — c?2/^22> 

Substitute on the right to obtain a second approximation, and proceed. 
Failure of the method will be indicated by failure of the approximations 
to tend to a limit. In both methods it is a saving of trouble to make 
a preliminary table of all the ratios b^Jb^, b^^lb^^,... so as to be able to 
give at once the correction to any unknown due to a change in any 
other. 

Evidently the rate of convergenee in both cases will depend on the 
latter set of ratios. As an example consider a set of equations 

X^ ~ l~kX 2 — kXg, "j 

Xj = —kxi—kz^, I (9) 

X 3 = — fcXj — I'Xg. j 

The second method gives ( 1 , 0, 0) as the first approximation, ( 1 , — I, — fe) 
as the second, {1 + 2 A;^, —k-^k’^, —k+k^) as the third, and so on. The 
second approximation always decreases W, the third decreases it if 
— 0-39 < k < 0-64 but otherwise increases it. 

Seidel’s method, applied to the same set of equations, gives in turn 

Xi =1, Xj = —k, X3 = —k+k^, 

Xj = 1 + 2A:*— P, X2 = — k-\-k^~k^-\-k* 

The correct solution, to order P, is 

Xi = l+ 2 k^- 2 k^, xg = X3 = —k+k^~Zk^. (10) 

The chief usefulness of these methods is in the estimation of many 

t Proc. Soy. Soe. A, 164, 1938, 447-67; Selaxation Methods in Engineering Science, 
1940; BeUixation Methods in Theoretical Physics, 1946. 



175 


§4.2 APPROXIMATE METHODS AND SIMPLIFICATIONS 

unknowns when some of them occur in only a small fraction of the 
equations of condition. The method of Southwell and Black has been 
applied, for instance, by the Ordnance Survey to problems where the 
work is laid out in many stages. I Each point gives rise to equations of 
condition connecting its position with those of the points observed 
from it and those it is observed from. Any displacement of its adopted 
position appears in no equation of condition for a point two stages away, 
or more, and most of the coefficients in the normal equations are there- 
fore zero. Hence the points can be adjusted in turn, beginning with 
those observed from the base-line. A modification of the second method 
was used by Bullen and me in the construction of the times of the P 
wave in seismology. J Here for each earthquake used there were three 
special parameters, namely, the latitude and longitude of the epicentre 
and the time of occurrence. The other parameters to be foimd were a 
set of corrections to the trial table at such intervals that interpolation 
would be possible. What was done was to use the trial tables to deter- 
mine the elements of each earthquake as if the tables were right. The 
residuals were then classified by distance to give corrections to the 
tables. The process was then repeated with the corrected tables as a 
standard. No change was needed after the third approximation. One 
advantage of these methods is that they are iterative and therefore 
self-checking; another is that they break up the work into parts and 
avoid the need to form and solve what would in this case have been 
normal equations for about 150 unknowns. The difference from the 
simple statements of the rules given above is that two or three un- 
knowns are adjusted at once instead of only one. 

An estimate of uncertainty can be obtained as follows. Remembering 
that the standard error of Xj is and that is the value 

found for Xj on putting 1 on the right of the normal equation for Xj and 0 
in all the others, we need only make this substitution, solve by iteration 
for each parameter in turn, and the standard errors follow at once. 

4.21. Combination of estimates with different estimated un- 
certainties. We have seen that when a set of observations is derived 
from the normal law, but the standard error is estimated from the 
residuals, its uncertainty makes the posterior probability of the true 
value follow the t rule instead of the normal law. The effect is fully 
taken into account in the standard tables for the t rule. But it often 
happens that several series of observations yield independent estimates 
of the same true value, the standard errors of one observation being 

t The Observatory, 62, 1939, 43. t Bur. Centr. Siism., Trav. Sci., Fasc. 11, 1936. 



176 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


difiEerent in the different series. Can we still summarize the information 
in any useful compact form ? The exact solution is straightforward; it is 

P{dx I oc p jl + j dx, (1) 

where x,, v,, and c, are the mean, number of degrees of freedom, and 
standard error of the mean of the rth set. This can be calculated exactly 
for any set of estimates, but it is unlikely that the calculation would 
often be undertaken. Clearly it is in general not reducible to fat rule. 

It would be useful if we could reduce (1) approximately to a t rule. 
We are mostly concerned with errors not more than a few times the 
standard error of our estimate. Consequently it is better to try to fit 
a t rule for small errors than for large ones. We can proceed by equating 
first, second, and fourth derivatives of the logarithms at the value of x 
that makes the density in (1) a maximum. It is obviously useless to 
equate third derivatives, because the t rule is always symmetrical and 
(1) need not be exactly so. We try therefore to choose f, c, v so that 


has zero first, second, and fourth derivatives at x = f . The conditions 
are 

s 1/ I 

( 3 ) 


z 


X- 


V 


u^(x) 


^^ = 0 , 



where u^{x)=\-^— — (6) 

These can be solved by successive approximation without much diffi- 
culty. It may be noticed that for a single t rule the expectation of 
ljUf{x) is Vf.j{vj.-\-\) and that of the right side of (4) is I K+1 )/(*',+ 3)c?. 
Hence in a first approximation we can weight the x,. in accordance with 
their unmodified standard errors, but c~^ will be systematically less than 
2 The approximation therefore corrects the underestimate of the 
second moment made by using the normal law instead of the t law for 
the separate series. The solution allows series even with v, = 1 to be 
taken into account (cf. 3.4(13)). v can be called the effective number 
of degrees of freedom. 



§4.2 APPROXIMATE METHODS AND SIMPLIFICATIONS 


177 


In some oases (1) may have more than one maximum. Attempts to 
combine the estimates are then undesirable. 

4.3. The use of expectations. When a law of chance is such that 
sufficient statistics do not exist, it is often possible to proceed by con- 
sidering some function or functions of the observations. Given the 
parameters in the law, the expectations of these functions may be 
calculable in terms of the parameters. But the observations themselves 
yield the actual values of the functions for that set of observations. If 
the number of functions is also the number of parameters in the law, 
estimates of the parameters can be got by equating the theoretical and 
observed values. If the functions chosen are such that their expecta- 
tions are actually equal to the parameters they are called unbiased 
statistics by E. S. Pearson and J. Neyman. 

There are apparently an infinite number of unbiased statistics 
associated with any law. For we might choose any function of the 
observations, work out its expectation in terms of the law, and trans- 
form the law so as to introduce that expectation as a parameter in 
place of one of the original ones. A choice must therefore be made. 

If a, /3, y are parameters in a law, we can choose functions of a set of 

n possible observations g{Xi «„), h{xy^ x^) and work 

out their expectations F, G, H, so that these will be functions of a, j3, y 
and will yield three equations for them when applied to an actual set of 
observations. Actually, however, the observed values will differ some- 
what from the expectations corresponding to the correct values of the 
parameters. The estimates of a, /3, y obtained will therefore be a, b, c, 
which will differ a little from a, y. The choice is then made so that 
all of E{a—oi)^, E(b—^)^, E(c—y)^ will be as small as possible. 

It should be noticed that an expectation on a law is not necessarily 
found best by evaluation of the corresponding function of the observa- 
tions. Suppose, for instance, that we have a set of observations derived 
from the normal law about 0 and that for some reason we want the 
expectation of x*. This could be estimated as ^ x^/n from the actual 
observations. Its theoretical value is 3a*. But 



= 2 ^ + ^*^(2 x*)E(2' r*)-9a«. 

3596|66 w 



178 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


S' meaning the sum over all values except the one taken to be x in S 
(all pairs occurring twice in the double summation); and this is 

n n n ‘ 

On the other hand, we find 



whence 

n \n^j 

Thus three times the square of the mean square deviation is systematic- 
ally nearer the fourth moment of the law than the mean of the fourth 
powers of the deviations is. We should be entitled to call x*ln an 
unbiased statistic for the fourth moment of the law; but it is not the 
statistic that, given the parameters in the law, would be systematically 
nearest to the true value. In this case ^ is s, sufficient statistic, 
and we have an instance of the rule that we shall get the best estimates 
of any function of the parameters in the law by using the sufficient 
statistics, where these exist. 

It may be asked why, seeing that the calculations are done on the 
hypothesis that a is known, we should be interested in the probable 
consequences of taking either or x* to derive an estimate of a, seeing 
that both estimates will be in error to some extent. In this case the 
interest is not great. The practical problem is usually to estimate v 
from the observations, taking the observations as known and a as 
initially unknown, and the set of observations is unique. Then we know 
from the principle of inverse probabiUty that the whole information 
about a is summed up in x* and we need consider no other function 
of the observations; if we have x® no other function will tell us anything 
more about a, if the normal law is true; if we have not x^, but have 
some other function of the scatter of the observations, there must be 
some loss of accuracy in estimating a, since x® is uniquely determined 
by the observations but will not be uniquely determined by this other 
fimction. Nevertheless occasions do arise where it is convenient to use, 
to provide an estimate, some fimction of the observations that is not 
a sufficient statistic. If sufficient statistics do not exist, the posterior 
probabihty distribution for a parameter may be unobtainable without 
a numerical integration with regard to the others, and this is often too 
formidable an undertaking. Then it is worth while to consider some 
set of statistics that can be conveniently found from the observations. 



§4.3 APPROXIMATE METHODS AND SIMPLIFICATIONS 179 

a 

This involves some sacrifice of information and of accuracy, but we 
shall still want to know what precision can be claimed for the estimates 
obtained. This will involve finding the probability distribution for the 
statistics used, given the parameters in the law; and then the principle 
of inverse probability will still give the probability distribution of the 
parameters in the law, given these statistics. By considerations similar 
to those of 4.0 the effect of moderate variations in the prior probability 
is unimportant. We shall have lost some accuracy, but we shall still 
know how much we have kept. 

Fisher has introduced the convenient term ‘efficiency’, defined as 
follows. Let <7*(a) be the expectation of the square of the error of an 
estimate, obtained by the method of maximum likelihood or inverse 
probability, and let a'^ioc) be the corresponding expectation found by 
some other method. Then the efficiency of the second estimate is 
defined to mean the limit of a^a)ja'^(x) when the number of observa- 
tions becomes large. In most cases both numerator and denominator 
are of order 1 jn, and the ratio has a finite limit. For the normal law the 
efficiency of the mean fourth power is |. It may be said that such losses 
of efficiency are tolerable; an efficiency of f means that the standard 
error of the estimate is 1-1.5 times as large as the most accurate method 
would give, and it is not often that this loss of accuracy will affect any 
actual decision. Efficiencies below J, however, may lead to serious loss. 
If we consider what actually will happen, suppose that a is the true 
value of a parameter, a the estimate obtained by the most efficient 
methods, and a' that obtained by a less efficient one. Then 
E{a-<xf = (t2((x), E(a'-a)3 = 

But these quantities can differ only because o' is not equal to o; and 
if both o and o' are unbiased, so that 

F(o — a) = E(o' — ol ) = 0, 
we have E(a'—a)^ = a'^(a) — (7^(a). 

If o' has an efficiency of ,50 per cent., so that o'(a) — V2o(a), o' will 
habitually differ from o by more than the standard error of the latter. 
This is very liable to be serious. No general rule can be given; we have 
in particular cases to balance accuracy against the time that would be 
needed for an accurate calculation, but as a rough guide it may be said 
that efficiencies over 90 per cent, are practically always acceptable, 
those between 70 and 90 per cent, usually acceptable, but those under 
.50 per cent, should be avoided. 

The reason for using the expectation of the square of the error as 



180 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


the criterion is that, given a large number of observations, the proba- 
bility of a set of statistics given the parameters, and that of the 
parameters given the statistics, is usually distributed approximately on 
a normal correlation surface; for one parameter and one statistic this 
reduces to the normal law. The standard error appearing in this will 
be the expectation that we have considered. 

One important case where these considerations arise is that of 
observations derived from an unknown law of error. Suppose that the 


law is 


P{dx I aH) 


^/xWa; 
\(t/ a 


(1) 


and that the origin is taken so that E(x) = 0. Let E{x'^) ~ fi^. We 
know that the chance of the mean of n observations is nearly normally 
distributed about 0 with standard error is a determinate 

function of a. But in the inverse problem we have to find pj from 
the observations, and this may be attempted as follows. Consider 
E{^ (x—x)^} taken over n observations. The probability distributions 
of all the observations separately, given aH, are independent, and 

2 (x—x)- = 2 x^~2 2 x.x-\-nx~ = 2 x^—n.c^, (2) 

m = («-l)P2- (3) 

Hence — — L will be an unbiased estimate of pg. It will not, however, 

n — 1 

be the accurate value of pj, and we proceed to consider its expectation 
of error. We have 


= E[{1 {x-xfY-2(n-\)^.,2(x-x)^+(n~\y-^] 

= (3:-xyf~(n~l}yi 

~ E\(^x"^—nx^y^—{n—\y^i\ 

= E[(^x^y—2nx^'^x^-ynH*]—{n—\yyil. (4) 

Now E(l x’^y = .e 2 I x\ 2' xl (5) 

S' denoting summation over all x’s except Xj; the 2 is taken into 
account by the fact that each pair will appear twice. Hence 

E(J^x^y = nfii+n(n—l)fil; ( 6 ) 

also 1 

E(fi£^ Z Z ^(^1+ Z' * 2 )® 


= P4+o+(»— 


( 7 ) 



§4.3 APPROXIMATE METHODS AND SIMPLIFICATIONS 181 

(6 having been replaced by 3 to allow for the double summation). 
Hencet 

E[^ (a:-f)2-(n-l)^,]2 = + (9) 

Tl y 

Thus the accuracy of the estimate of the second moment of the law 
will depend on the fourth moment, that of the fourth on the eighth, 
and so on. Apparently, therefore, we arrive at no result unless we have 
the complete set of moments; but only n independent ones can be found 
from the observations, and for laws of Types IV and VII the higher 
moments of the law do not exist. However, this is not so serious as it 
seems. We are usually interested primarily in the mean and its uncer- 
tainty, the latter being of order n-'i-. But the uncertainty of is also 
of order if exists; and therefore will affect the uncertainty of 
the mean by something of order Quite a rough estimate of will 
therefore be enough. We can get this by considering 

~ -^[2 2 2 ^ x^x^—Znx*]. ( 10 ) 

Here EC^x^x) = - E '2, 4(^i+ T ^ 2 ) = ^ ^ ) 

Tt 

and we find 

Given the law, the errors of x and 2 are not necessarily 

independent. We have 

E[nx{J^ (x-x)2-(n-lj;i2}] = E{(2x)(^x^-nx^)} 

= ^;(2x»)-l^(2x)'' = (n-l)|a3, (13) 
n 

E{'^ (x— x)3} ^ (?t— l)|l —^1^3- (14) 

There will therefore be a correlation between the errors of location and 
scaling if the law is unsymmetrical. With such a law, if is positive, 
there will be a strong concentration of chance at small negative values 
of X and a widely spread distribution over positive values. Thus a 
negative error of the mean will tend to be associated with a small 
scatter of the observations and a positive one with a large scatter. 

The higher moments in such a case furnish an example of what 

t This can also be derived easily from Fisher, Proc. Land. Math. Soc. 30, 1930, 206. 



182 APPKOXIMATE METHODS AND SIMPLIFICATIONS Chap. TV 

Fisher calls ancillary statistics, which are not used to estimate the para- 
meters but to throw additional light on their precision. The number 
of observations is always an ancillary statistic, x and ^ I ) 

are imbiased statistics for the parameter of location and its standard 
error, but they sacrifice some information contained in the observations 
if the law is not normal. According as is more or less than Spli fhe 
estimate of will be less or more accurate than a similar estimate 
from the same number of observations given the normal law. In the 
former case the posterior probability for the location parameter will 
resemble a t distribution with less than n— 1 degrees of freedom, in the 
latter one with more. If for reasons of convenience, then, we take as 


as for the normal law, attention to fig 


• 

our estimate x+\^^ as for the normal law, attention to pg 

\ n{n—l) I 

and will recover some of the information concerning the distribution 
of the chance of large errors. 

The correlation between x and 2 is 

. ^ (x-xf-(n-l)fi2}] P3 

(x-x)2- (n- jp^|p, 

and if we write 

= {2(a;-x)2-(n-l)p,} = (n-l)p;, (16) 

we shall have 


^-3 A)’ 


= erf, 


P(dxdpj I aH) 






with considerable accuracy, and this may be used in place of the likeli- 
hood in assessing the posterior probabilities when the location and scale 
parameters are to be found from the observations. 

If p 4 is infinite, as for a Type VII law with index 2, the expression (9) 
is infinite, and it appears that the estimate of pg will have an infinite 
uncertainty. This does not prove, however, that the estimate is useless. 
It means only that the chance of error in p^ is so far from being normally 
distributed that it has an infinite second moment. The law for it will 


resemble the Cauchy distribution (index 1); though this has an infinite 
second moment it is possible to find on it a deviation with the same 
chance of being exceeded as for any given deviation on the normal law; 
it does not represent infinite uncertainty. But what will be true is that 
the chance of large errors in pj as estimated will fall off less rapidly 
than it will for finite p^ as n increases. 



§4.3 APPROXIMATE METHODS AND SIMPLIFICATIONS 


183 


The method of expectations sometimes fails completely. Karl Pear- 
son’s procedure in fitting his laws was to find the mean of the observed 
values, and the mean second, third, and fourth moments about the 
mean. These would be equated to Ex, E{x—Ex)^, E(x—Ex)^, and 
E(x—Ex)*. This process gives four equations for the parameters in the 
law, which can then be solved numerically. These moments are not in 
general sufficient statistics, since the likelihood cannot be expressed in 
terms of them except in a few special cases. The resulting inaccuracy 
may be very great. For the Type VII law 

when m < the expectation of the fourth moment is infinite. The 
actual fourth moment of any set of observations is finite, and therefore 
any set of observations derived from such a law would be interpreted 
as implying m > f . For some actual series of observational errors m is 
as small as this or nearly so. Pearson does not appear to have allowed 
for finite n; he identified 2 with '2, ^'y neglecting the error of x. 

This is usually trivial in practice. But Pearson’s delight in heavy 
arithmetic often enabled him to give results to six figures when the 
third was in error for this reason and the second was imcertain with any 
method of treatment. The method of minimum should give greater 
accuracy with little trouble; other approximate methods, approaching 
the accuracy of the method of maximum likelihood at its best, are 
available for Types II and VII, and for I and IV as long as the asym- 
metry is not too greatf; for Types III and V with known termini, 
sufficient statistics exist. If the terminus is known to be at 0, the 
arithmetic and geometric means are sufficient for Type III, the geo- 
metric and harmonic means for Type V. For the rectangular law the 
extreme observations are sufficient statistics in any case. 

The property of the extreme observations for the rectangular law 
can be somewhat generalized. For suppose that the lower terminus is at 
a; == a, and that < x, 1 = Aix^-ccY (18) 

for x-^—x small. Then the chance that n observations will all be greater 
than Xi is {1— j4(a’i— the differential of which will be the chance 
that the extreme observation will lie in a range dx^. Taking the prior 
probability of a uniform, we shall have 

P{da \xiH)oc {I — da 

oc (xj— Qi)’’-^exp{— (n— a)''} da (19) 
t PhU. Trans. A, 237, 1938, 231-71. 



184 APPBOXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


for large n. For r = 1, the rectangular law, this makes the expectation 
of Xi—a, given of order 1/n; for r < 1, corresponding to U-shaped 
and J -shaped distributions, the expectation falls off more rapidly than 
l/?i; even for r — 2, it still only falls off like Thus even for laws 
that cut the axis at a finite angle the extreme observation may contain 
an amount of information about the terminus comparable with that 
in the remainder; for other laws between this and the rectangular law, 
and for all U-shaped and J-shaped distributions, the extreme observa- 
tion by itself may be used to provide an estimate of the terminus. This 
remark, due originally to Fisher, shows the undesirability of grouping 
the extreme observations in such cases. It may easily happen that the 
grouping interval is more than the uncertainty derivable from the 
extreme observation alone, and then grouping may multiply the un- 
certainty attainable several times. 

Sometimes a law would possess sufficient statistics if certain minor 
complications were absent. It is then often sufficiently accurate to 
find expectations of the contributions to these statistics made by the 
minor complications, and subtract them from the values given by the 
observations. The method of maximum likelihood can then be used. 
An example of a common type is given in 4.6. 

4.31. Orthogonal parameters. It is sometimes convenient to 
choose the parameters in a law so that the product terms in of 4.0 (4) 
will have small coefficients. If the maximum likelihood estimates of the 
parameters in a law g{x,a^) are a,-, and if oc^~a( — aj. 


log A = J,logg{x„ai) 


= 1 log g{Xr, ai)+ - 2 log g . 4, 


( 1 ) 


the derivatives being evaluated at a,- = a^. Now the expectation of the 
coefficient of otj is 


J docf g 8(x,c J \ g 


Since j g dx = 1 for all a^, the second part of the integral is zero; hence 

logj/.ojafc = — (3) 


eIs 


2 ^ doL^docf^ 


where is the same function as in 3.9. There is therefore a direct 
relation between the expectation of the quadratic terms in log L and 
the invariant forms /j and J used in 3.9. 



185 


§4.3 APPROXIMATE METHODS AND SIMPLIFICATIONS 


Now if g^fcda^doL|c is regarded as the square of an element of distance 
in m dimensions, at any point it will be possible to choose in an infinity 
of ways a set of m mutually orthogonal directions. We can then choose 
orthogonal coordinates so that if 

^ik dotk — hji djij dpi ( 4 ) 

all hji vanish except for j = 1. If the law g{x, a^) is then expressed in 
terms of the quantities instead of otj-, the quadratic terms in ^^(log L) 
will reduce to a sum of squares, and for an actual set of observations 
the square terras in log L will increase like n, while the product terms 
will be of order Thus the equations to determine the Pj will be 
nearly orthogonal, and practical solution will be much simplified. The 
product terms can be neglected for large w, since their neglect only 
introduces errors of order n-^. 

For instance, take a Type VII law in the form 


y = To—, 


(m— 1)! 




where M is a function of m. Evidently 


'2Ma^ I 


(5) 


and 


are zero. The condition that 
to be 


/ 


Vz 


8^ 


dmda^ 

1 


J^dM_ 

M dm m(m — w— ^ m’ 


log y dx shall vanish is found 
3 2 


(6) 


For y to tend to the normal form with standard error a when m-^oo 
Mjm must tend to 1; we must therefore have 


so that 


M = (^ < m < oo), 

_ m! f m*(a:— A)* I-” 


(7) 

(8) 


With the law in this form we can form the maximum likelihood equa- 
tions for A, 0 -, and m, neglecting non-diagonal terms, and approximation 
is rapid, any error being squared at the next step. 

For Type II laws the corresponding form is 


^ (m-j)! L mg(x— A)« 

{27r(TO-t-J)}‘/a(TO— 1 )!ct| 


(1 < m < oo). 


(9) 


For m ^ 1, dyjdx does not tend to 0 at the termini. It is then best to 
take the termini explicitly as parameters. 



186 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

Specimen curves for A = 0, c = 1 are given in the diagram, Fig. 2. 
The maximum likelihood equations for a Type VII law in this form are 

^log L - 2 ^ 

--logi = !^ _ 0 (u) 

= wm^/^logm!— ^log(m— — i— -I — 

' _ V mnogh + + y m^(m+I)(:r-A)^ 

where /i = 1/m. 



Flo 2, Laws of Typos II (m negativo) and VH (m positive) for a = 1 

For Type II they are 

4 logL = J!L y 

8X ® if < t2 Z, 1 - (x-X)^l2Mo^ 

-~iogL = y 

8a a Ma^‘^l — (x — A)*/2j 

±logL = + 

Z ^( + 2Ma^j^ Z 2(m+ i)*a%l - (a:ZA) 2 /i^ == 0> (16) 


= 0 , 

( 13 ) 


( 14 ) 



§4.3 APPROXIMATE METHODS AND SIMPLIFICATIONS 


187 


where fi = —Ijm. It is convenient to define fi as -{-1/m for Type VII 
and as — 1/m for Type II, since this provides for continuous passage 
through the normal law by increasing fi through 0. Actual fitting would 
be done as follows. First treat m as infinite and find first approxima- 
tions to A and a as for the normal law. Substitute in (12) or (15), for a 
number of trial values of m. Interpolation gives a value of /x, and the 
divided differences give a value of dHog which is l/s*(/x). Return 

to (10) and (11), or (13) and (14), and derive estimates of A and a. If the 
changes are considerable, solve afresh the equation for m. 

An approximate allowance for the effect of the uncertainty of a on 
the posterior probability distribution of A can be found as follows. For 
the normal law 

The numerical solution here gives a value for 8^{a ) ; we can define 
n' =. s2/2s2(a) = j^^(-log L)j , 

and, since two parameters besides A have been estimated, we can take 
the effective number of degrees of freedom as — 2. 

A table of dlogm!/dm is given by E. Pairman at intervals of 0-02 up 
to m = 20.t For m 10 it is given in the British Association Tables. 

4.4. If the law of error is unknown and the observations are too few to 
determine it, we can use the median observation as a statistic for the 
median of the law. We can then proceed as follows. Let a be the median 
of the law ; we want to find a range such that the probability, given the 
observations, that a lies within it has a definite value. Let a be a possible 
value of OL such that I observations exceed a and n— 1 fall short of it. Then 

P{1 1 a, ?r, H) = «q(i)» = |J-J\xpj — ( 1 ) 

nearly; and if the prior probability of a is taken uniform, 

P{doi 1 1, n, H) oc j’'“exp j - j • (2) 

Thus the posterior probability density of a is a maximum at the median, 
and if we take I — as limits corresponding to the standard 

error, the corresponding values of at will give a valid uncertainty, what- 
ever the law and the scale parameter. The limits will not in general 
correspond to actual observations but can be filled in by interpolation. 

The question of departure from the normal law is commonly 
t Tracis for Computers, No. 1. 




188 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

considered in relation to the ‘rejection of observations’. Criteria for the 
latter have been given by Peirce and Chauvenet. They appear, how- 
ever, to be wrong in principle. If observations are legitimately rejected, 
the normal law does not hold, and these observations could be used to 
estimate a departure from it. There is no reason to suppose that the 
retained observations are themselves derived from the normal law, and, 
in fact, there is reason to suppose that they are not; and then the mean 
and standard error found from the observations retained may easily be 
invahd estimates. Another consideration is that if we make a definite 
rule that observations within certain arbitrary limits are to be retained 
at full weight and all beyond them rejected, then the decision about 
a single outlying observation may easily affect the mean b}^ its apparent 
standard error, which is highly undesirable. Again it is often advocated 
that the uncertainty of the true value, as estimated from the mean, 
should be got from the average residual without regard to sign instead 
of the mean square residual, on the ground that the former is less 
affected by a few abnormally large residuals than the latter is. But if 
the mean of the observations is taken as the estimate of the mean of 
the law, its uncertainty is correctly estimated from p.2> latter 

exists, and if it does not exist the uncertainty will not be proportional 
to For all laws such that /ig exists the mean square residual gives 
an xmbiased estimate of /ig. The ratio of the expectation of the average 
residual without regard to sign to V/ig, however, depends on the form 
of the law of error. If the average residual is found and then adapted 
to give an estimate of V/xg by appl3dng the factor found for the normal 
law, this factor will be too small for laws of Type VII, which are pre- 
cisely those where the use of this method is recommended. The cases 
where this treatment is recommended are just those where it is most 
likely to lead to an underestimate of uncertainty. If the mean is taken 
as the estimate, there is no alternative to the mean square residual to 
provide an estimate of uncertainty when the law is in doubt. 

On the other hand, it is only for the normal law that the mean is 
actually the best estimate, and for other laws we are entitled to con- 
sider other estimates that may be more efficient. One interesting case 
is the law 

P{dx \ m,a,H) = |exp| 

Here we find easily that the likelihood is a maximum if m is taken 
equal to the median observation, and if o is the average residual without 
regard to sign. This law is therefore known as the median law. Given 


\x—m\\dx 
a ) a' 



§4.4 APPBOXIMATE METHODS AND SIMPLIFICATIONS 189 

any of the three properties the other two can be deduced. It is only 
subject to this law that the average residual leads to the best estimate 
of uncertainty, and then the best estimate of the location parameter 
is provided by the median observation and not by the mean. The 
interest of the law is reduced somewhat by the fact that there do not 
appear to be any cases where it is true. It has the property, however, 
that it lies higher on the tails and in the centre, and lower on the flanks, 
than the normal law with the same second moment, and these pro- 
perties are shared by the laws of Type VII. Fisher shows that for the 
Cauchy law the standard error of the median of n observations is 7rj2^n, 
while that of the maximum likelihood solution is ^(2jn). Thus the 
efficiency of the median as an estimate is = 0-81, which is quite 
high, in spite of the fact that the expectation of the average residual 
without regard to sign is infinite. For the normal law it is 2jiT = 0-64, 
and it varies little in the intermediate range of Type VII. In the corre- 
sponding range the efficiency of the mean varies from 1 to 0. There is, 
therefore, much to be said for the use of the median as an estimate 
when the form of the law is unknown; it loses some accuracy in com- 
parison with the best methods, but the increase of the uncertainty is 
often unimportant, and varies little with the form of the law, and the 
uncertainty actually obtained is found easily by the rule (2). An 
extension to the fitting of equations of condition for several unknowns, 
however, would be rather complicated in practice. The maximum like- 
lihood for the median law comes at a set of values such that, for each 
unknown, the coefficient of that unknown and the residual have the 
same sign in half the equations of condition and opposite signs in the 
other half. To satisfy these relations would apparently involve more 
arithmetic than the method of least squares. The simplicity of the use 
of the median for one location parameter does not persist for several 
parameters, and the practical convenience of the method of least 
squares is a strong argument for its retention. 

4 . 41 . The nature of the effect of the law of error on the appropriate 
treatment is seen by considering a law 

P{dx I olH) = f(x—a) dx. (1) 

The maximum likelihood solution is given by 

^ — a) , , /'(3?n — a) 

/(*!—“) ' * /(a^n— a) ' 


(2) 



190 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


If the arithmetic mean is the maximum likelihood solution for all pos- 
sible observed values, this is equivalent to 

0 = — a£)-|-...-t-(.r„— (jt), (3) 

whence f{x) = (4) 

the result obtained by Gauss. But if we put 


(x — <x)f{X—OL) 


(5) 


(2) is equivalent to 2 — “) = 0. (6) 

Hence a is a weighted mean of the observed values. If f’(x)/f{x) does 
not increase as fast as the residual, the appropriate treatment will give 
reduced weight to the large residuals. If it increases faster, they should 
receive more weight than the smaller ones. The former consideration 
applies to a Type VII law, for which, for large x—a, 


f'(x—oc)l(x—a)f(x-(x) 


behaves like —(x—a)~^ instead of being constant. The latter applies 
to the rectangular law, for which w is zero except at the ends of the 
range, where it is infinite. 

These considerations suggest an appropriate treatment in cases where 
the distribution is apparently nearly normal in the centre, but falls 
off less rapidly at the extremes. This kind of distribution is shown 
especially by seismological observations. If two observers read ordinary 
records and agree about which phases to read, they will usually agree 
within 1 or 2 seconds. But the arrival of a new phase is generally 
superposed on a background which is already disturbed, and the ob- 
server has to decide which new onsets are distinct phases and which 
are merely parts of the background. The bulk of the observers actually 
usually agree, but there are scattered readings up to 10 or 20 seconds 
away from the main concentration. The following are specimens. The 
residuals are in seconds. The first series refer to P at good Pacific 
stations, the second to intermediate ones, the third to S at short 
distances in deep-focus earthquakes. 

Residual -10 —9 -8 -7 -6 —6 —4 -3 -2 -1 0 1 2 3 456789 10 

Number (1) 0 1 1 1 1 1 4 8 13 14 13 8 10 2 4 1 1 2 2 0 1 

Number (2) 0 1 2 0 1 2 2 2 7 8 10 10 433241202 

Number (3) ?? 6 4 7 10 16 23 31 61 59 44 39 22 15 8 8 7 8 ? ? 


The central groups alone may suggest a standard error of about 2 seconds, 
but the second moment of the whole of the observations might at the 
best suggest one of 4 or 5 seconds. At the worst it would become 



§4.4 APPROXIMATE METHODS AND SIMPLIFICATIONS 191 

meaningless because there may be no definite gap between two distinct 
phases, and we have no rule so far for separating them. In such a case 
we may suppose that the law has the form 

P(dx I ct, A, H) - ii:^exp{-A2(ar-c.)*)+w^(a;-^), (7) 

where mg is always small and g varies little within ranges of order Ijh. 
Within this range we must regard g as an unknown function. Then 

logL = 2 log|^^^-^^^exp{— A 2 (a;-a) 2 }+mg(a:— ^)j, (8) 

L doc ^ {(l—m)h/'>Jn}exp{~h^x—ix)^}-\-mg(x—P)’ 

1 ~ = V ^ no) 

L dh ^ {{l—m)h/^TT}exTp[—h^x—<x)^}-\-7ng(x-~^) ’ 

or, if we write 

w-i = ^g(x-^)exp{A2(a;-a)2}, (11) 

Y-^ = 2^hMx-oc), ( 12 ) 

= (13) 

Thus, with the appropriate weights, the equations for a and h reduce 
to the usual ones. To find these weights requires an estimation of g, 
which need only be rough. We note that the density at large residuals 
is mg, and at small ones {\—m)hl\-!T-\-mg', thus the coefficient of the 
exponential in (11) is the ratio of the density at large values to the 
excess at the mode, which is estimated immediately from the fre- 
quencies. If we denote this coefficient by /x, we have 

= 14-/iexp{A*(x— a)^*}, (14) 

and apart from /x, g is irrelevant to a and h. Also, in S^log Lfdofi, the 
term in Bwjda is small on account of a factor /x(x— a)® when x— a is small, 
and of a factor /x“^exp(— li^(x— a)*} when x— a is large; in any case we 
can neglect it and take 

a = 

where a is given by 2 w = 2 — “)*• (16) 

The method has been applied extensively in seismology with satis- 
factory results. A change in /i or a necessitates a change of the weights. 


2 mx a 

«’)’ 


( 16 ) 



192 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


and it is usually necessary to proceed by successive approximation, but 
more often than not the second approximation almost repeats the first. 
As a starting-point we can find h roughly from the distributions of the 
frequencies near the centre, compute from it the expected frequencies 
according to the normal law, and use the excess on the flanks to esti- 
mate Alternatively, if there is a range of approximately constant 
frequencies on each side, we can subtract their mean from all frequen- 
cies, including those in the central group, replace negative values by 0, 
and compute a from the remainders. This has been called the method of 
uniform reduction. The chief use has been in finding corrections to trial 
tables. The residuals for all ranges together give a good determination 
of the weights, which are then applied to the separate ranges to give 
the required corrections. With this method the weight is a continuous 
function of the residual, and the difficulty about a hard and fast limit 
for rejection does not arise. 

4.42. In the usual statement of the problem of least squares the 

whole of the uncertainty is supposed concentrated in one of the variables 

observed, the others being taken as not subject to error. This is a 

common state of affairs, but not a universal one. It may happen that 

we have a set of pairs {x,y), which may be taken as estimates of two 

variables {$, tj) on diflPerent occasions, with a linear relation between 

them, and that the uncertainties of each determination of x and y are 

known and independent. The problem is to find the relation between 

^ and u. Write > . o . 

' T? = at+P- (1) 

Then a typical observation must be read as 

P{dx,dy,d^, \ cc,p,H) = d^rdVrd^r 

( 2 ) 

and log L = constant — ^ j, (3) 

the unknowns being the various a, and j8. Integrating with regard 
to all the I,, we get, with a uniform prior probabiUty for a and J3, 

Pidod^ 1 0^) oc n (^+a*^*)-’^exp[ - 2 

Hence we can write 

= (5) 

as a set of equations of condition to determine a and jS. Since the 
standard error involves a the solution must be by successive approxima- 
tion, but if the variation of x^ and y, is much more than that of and 



§4.4 APPROXIMATE METHODS AND SIMPLIFICATIONS 


193 


tf, a first approximation using equal weights will give a good estimate 
of a and the second approximation will need little change. The result 
is equivalent to using as the correct value of but using (1) and a,, 
with an approximate a, to estimate the uncertainty of tj at x^. 

4.43. Grouping. Suppose that we have observations x, of a quantity, 
for n different values of an argument t, and that we regard these as 
representing a linear function of t, say a+jSl; the standard error of each 
observation is a. Then a typical equation of condition will be 

x^ = ( 1 ) 

and the normal equations for a and jS will be 

noc + ^ 2 ( 2 ) 

+ (3) 

whence the standard error of fi is | — = — ;} a. If ^ is the mean 

of the the standard error of a.+pi is tr/Vn, and these uncertainties 
are independent. This is the most accurate procedure. 

On the other hand, we may proceed by taking the means of ranges 
of observations near the beginning and the end ; the difference will then 
yield a determination of /3. If there are m in each of these ranges and 
the means are (ii.Xj), we have 

ii ot+pi^±a/\'m, X 2 -- oc+^ig±cx/\m, (4) 


whence 



(5) 


Let us compare the uncertainties on the hypothesis that the observa- 
tions are uniformly spaced from < = —1 to +1. Then ^ t® will be 
nearly and the least squares solution has standard error a^jC&jn). 
Also < 2— <1 = 2(1— m/w) and the solution by grouping has standard 
error (7/(2m)'^“(l— m/n). The latter is a minimum if m = Jn, and 
then is equal to o-(27/8n)^''». The efficiency of the solution by group- 
ing, as far as )S is concerned, is therefore |, which for most purposes 
would be quite satisfactory, f The expectation of the square of the 
difference between the two estimates would correspond to a standard 
error ^ of that of the better estimate. If we took m = we should 
get a standard error of 2ajn^^^, and the efficiency would be f . 

The best estimate of is the mean observation, and it is of 

no importance whether we average the observations all together or 
average the means of the three ranges. Hence we shall sacrifice 


t The result is due to Sir Arthur Eddington, but he did not publish it. 
36ft5.68 n 



194 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


hardly any accuracy if we divide the observations into ranges each 
containing a third of the observations, determine ^ by comparison of 
the first and third, and oc+^i from the mean of all three with equal 
weight. 

Again, suppose that t ranges from 0 to 2n, and that we want to 
determine cos t from the observations of x. The normal equations 

710C+/3 2cO8<, = (6) 

a 2 cos + /3 ^ COS\ = 2] CO® ^r- C^) 


If the arguments are equally spaced we shall have CT^(a) = 

= 2a^ln. 

But we may compare means by ranges about 0 and -n. The sum 
of the observations between 0 and pir and between (2~p)iT and 
will give, nearly. 


PJr 

npoL + ^ f cos tdt — nx-y o^(np ) 
2tt j 


( 8 ) 


-pn 


and the corresponding equation for the opposite range follows. Hence 
(8 can be estimated from 


Q 

2“sinp7r £i~X2:t°'^J(2pjn) 


( 9 ) 


and will be found most accurately if p'^'^cosec pir is a minimum. This 
leads to pn = 66° 47'. The convenient value pn — 60° gives 

a^(P) = ( 10 ) 

and the efficiency is 9/n^ = 0-91. If we take p = thus comparing 
whole semicircles, we get an efficiency of S/n^ == 0-81. The use of 
opposite ranges of 120°, while giving high efficiency, also has the 
merit that any Fourier term whose argument is a multiple of two or 
three times that of the term sought will contribute nothing to the 
estimate. If we used ranges of 180°, a term in 3t would contribute to 
the estimate of )3, but this term contributes nothing to the mean in a 
120° range. 

Thus drastic grouping, if done in the best way, loses little in the 
accuracy of the estimates. The corresponding analysis for frequencies 
instead of measures leads to the same results. f There may, however, 
be serious loss when the chance considered falls off rapidly towards the 
tails. I found this in discussing errors of observation; the sacrifice of 
t Ptoc. Roy. Soc. A. 164 , 1938, 311-14. 



§ 4.4 APPROXIMATE METHODS AND SIMPLIFICATIONS 


195 


the information about the distribution of the errors in ranges where 
the expectations according to the normal law were small led to the 
standard errors being increased several times. 

The method is particularly useful in carrying out harmonic analysis. 
When the data are measures, if we use opposite ranges of 120°, the 
coefficient of a sine or cosine is given by 




TTO'j2 

i'Jn 


== l-814(fi— i2)±l-481(T/Vra. (11) 

Where the problem is to estimate a Fourier term in a chance, if and 
Tij are the numbers of observations in opposite ranges of 120°, we get 

^ ( 12 ) 

n \n 


The similarity of the coefficients corresponds to the result in the mini- 
mum x'^ approximation that we can enter an observed number in an 
equation of condition as 

4.44. Effects of grouping : Sheppard’s corrections. In some 
cases it is desirable to make allowance for less drastic grouping than 
in 4.43. Suppose, as in 3.41, that the true value is x and the standard 
error a, and that we take a convenient arbitrary point of reference Xg. 
Then all observ ations between (r^b^)^ will be entered as Xf,-\-rh, and 
our data are the numbers of observations so centred. As before, we 

P{dxda I H) oc dxdaja, ( 1 ) 


but the chance of an observation being given as Xo-fr/t is now 

Jo + (r+l)5 

3fo+(r-Dft 

Two cases arise according as h is large or small compared with a. In 
the former case the chance is negligible except for the range that in- 
cludes X. Hence if we find nearly the whole of the observations in a 
single range we shall infer that a is small compared with h. The likeli- 
hood is nearly constant for values of x in this range, and we shall be 
left with a nearly uniform distribution of the posterior probability of 
X within the interval that includes the observations, no matter how 
many observations we have. This is an unsatisfactory result; the 
remedy is to use a smaller interval of grouping. 



196 APPBOXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


If A is small with regard to <7, and if we put 


i—x„—rk = 7), (3) 

ih 

j expj^— ^{(*o+»-A-a:)*+2ij(a;o+rA— a:)+7j2}j drj 

= expj^— ^(a;o+rA— a:)2j J |^1 _^(aro+rA-ar)+ 

-jft ^ 

+i(».+r*-*)--^] d-i 


to order A®; and we shall have for the joint probability of the observa- 
tions given X and a, 

P(e\x,a,H) 


oc 


o-»exp|^-^2 {Xa+rh-xf-no'^}\^ 

a-"exp|^— ^{{*-a;)®+s®} + — ({x-a;)®-bs 2 -a 2 } 


where x and are a mean and a mean square residual found from the 
recorded values. To this acctiracy they are still sufficient statistics. 
Hence 


Pidxda 1 BE) 
oc ff-”-iexi: 


Differentiating ( 6 ) or ( 6 ) we see that the maximum for a; is at r, and 
that for <7 is at 


r ”• /- 

A® \ 

ns^(. 

A® \ 

nA®l 


12(7®/ 

-M'- 

12(7®) 



C7® = «2-^A2+0(n-i). 

The coefficient of {x—x)^ in ( 6 ) is therefore, to this order, 

A® 


2(«® 


n ^ 

— ~T^y "" 25 ®' 


(7) 

(8) 


Without the allowance for finite h the corresponding values would be 
a* and n/2s^. Hence (1) the uncertainty of x can be taken from the mean 
square residual as it stands, and needs no correction; ( 2 ) to estimate a® 
we should reduce a® by A®/ 12 . 

The latter correction is due to W. F. Sheppard.f He proceeded by 
considering the expectation of the contribution to a®, given a, due to 


t Proc. Land. Math. Soc. 29, 1898 , 368 . 



197 


§4.4 APPROXIMATE METHODS AND SIMPLIFICATIONS 

♦t 

the finite h, and obtained the correction in this sense for any law of 
error. He showed also that the contribution to the third moment is 
zero, and to the fourth jlgA*, which should therefore be subtracted 

from the mean fourth moment of the observations before finding that 
of the law. It is in this form that the corrections have been most used. 
But the above argument brings out the point, also made by Fisher, that 
the uncertainty of the true value, given the observations, is determined 
by the uncorrected second moment and not by the corrected one. It is 
only when, as in computing a correlation from grouped data, we are 
directly interested in a^, that there is any point in applying the correc- 
tion. There will be a slight departure from the rule of 3.41 in the 
posterior probability distribution of x, but this is negligible. 

4.45. There is a similar complication when the standard error con- 
sists of two parts, one of which may be supposed known and equal to 
a', while the other is to be found. There are two plausible assessments 
of the prior probability. We may take a to be the complete standard 
error, but restricted now to be greater than a'; then the rule would be 

P{da\H)<x. daja, (1) 

for a > o'. On the other hand, we might take this rule to apply to only 
the unknown portion (cr^— then 

P(do I H) oc d log(a2-<T'2)-& X . (2) 

But the latter leads to an absurd result. For the likehhood is still 
proportional to |- 

ff-^expj^— i)2-|-s2}J (3) 

and (2) will lead to a pole in the posterior probabihty at ct = o' . Thus 
the inference using this assessment of the prior probability would be 
that a = o' , even though the maximum likelihood will be at a larger 
value of cr; (1) on the other hand leads to the usual rule except for a 
negligible effect of truncation. 

The situation seems to be that in a case where there is a known 
contribution to the standard error it is not legitimate to treat the rest 
of the standard error as unknown, because the known part is relevant 
to the unknown part. The above allowance for grouping is a case in 
point, since we see that it is only when h is small compared with o that 
n observations are better than one; if the interval was found too large it 
would in practice be taken smaller in order that this condition shoidd 
be satisfied. The case that attracted my attention to the problem was 



198 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. TV 

that of observations of gravity, where repetition of observations at the 
same place shows that the accuracy of observation is of the order of 3 
milligals (1 milligal = 0-001 cm./sec.^), but there are differences between 
neighbouring places of the order of 20 to 50 milligals. In combining 
the data to obtain a representative formula the latter must be treated 
as random variation, to which the inaccuracy of observation contributes 
only a small known part. The use of (2) would then say that we shall 
never dispose of the possibility that the whole of the variation is due 
to the observational error ; whereas it is already disposed of by the com- 
parison of observations in different places with the differences between 
observations repeated at the same place. This is a case of intraclass 
correlation (see later, 5.6); we must break up the whole variation into 
a part between stations and a part between observations at the same 
station, and when the existence of the former is established the standard 
error is found from the scatter of the station means, the differences 
between observations at the same station having little more to say. 
Thus the proper procedure is to use (1) or else to treat the standard 
error as a whole as unknown, it does not matter which. 

4.5. Smoothing of observed data. It often happens that we have 
a series of observed data for different values of the argument and with 
known standard errors, and that we wish to remove the errors as far 
as possible before interpolation. In many cases we already know the 
form of the function to be found, and we have only to determine the 
most probable values of the parameters in this function. The best 
method is then the method of least squares. But there are cases where 
no definite form of the function is suggested. Even in these the presence 
of errors in the data is expected. The tendency of random error is 
always to increase the irregularities, and part of any irregularity can 
therefore be attributed to random error, and we are entitled to try to 
reduce it. Such a process is called smoothing. Now it often happens in 
such cases that most of the third, or even the second or first differences, 
at the actual tabular intervals, are no larger than the known uncertainty 
of the individual values will explain, but that the values at wider in- 
tervals show these differences to be systematic. Thus if we have values 
at unit intervals of the argument over a range of 40, and we take 
differences at intervals 10, any systematic second difference will be 100 
times as large as for unit intervals, the random error remaining the 
same. The situation will be, then, that the values at unit intervals give 
no useful determination of the second derivative of the function, but 



§4.6 APPROXIMATE^ METHODS AND SIMPLIFICATIONS 199 

this information can be provided by using wider intervals. On the other 
hand we want our solution to be as accurate as possible, and isolated 
values will not achieve this; thus the observed values from argument 15 
to 26 will all have something to say about the true value at 20, and we 
need to arrange our work so as to determine this as closely as we can. 

In such a case we may find that the values over a range of 10 are 
enough to determine a linear function by least squares, but that the 
coefficient of a square term is comparable with its standard error. If 
we reject the information about the curvature provided by a range of 
10, we lose little; and in any case comparison with adjacent ranges will 
give a much better determination. This suggests that in a range of 10 
we may simply fit a linear function. But if we do this there will be dis- 
continuities wherever the ranges abut, and we do not want to introduce 
new spurious discontinuities. We notice, however, that a linear func- 
tion is uniquely determined by two values. If then we use the linear 
solution to find values for two points in each range we can interpolate 
through all ranges and retain all the information about the curvature 
that can be got by corapari.son of widely separated values; while the 
result for these two values will be considerably more accurate than for 
the original ones. Such values may be called suvimary values. 

Now the two values of the independent variable may be chosen 
arbitrarily, in an infinite number of ways consistent with the same 
linear equation. The question is, which of these is the best ? We have 
two considerations to guide us. The computed values will still have 
errors, of two types : ( 1 ) Even if the function sought was genuinely linear, 
any pair of values found from the observed ones would have errors. 
If we take the values of the argument too close together, these errors 
will tend to be equal; if they are too far apart they \\ill tend to have 
opposite signs on account of the error of the estimated gradient. There 
will be a set of pairs of values such that the errors are independent. 
But any interpolated value is a linear function of the basic ones. If 
we choose one of these pairs, the uncertainty of any interpolated value 
can be got by the usual rule for compounding uncertainties, provided 
that these are independent. If they are not, allowance must be made 
for the correlation, and this makes the estimation of imcertainty much 
more difficult. (2) We are neglecting the curvature in any one range, 
not asserting it to be zero. At some points in the range the difference 
between the hnear solution and the quadratic solution, both by least 
squares, will be positive, at others negative. If we choose summary 
values at places where the two solutions agree, they are independent 



200 APPBOXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

of the curvature and therefore of its uncertainty; and this will not hold 
of any others. Neglect of the curvature will therefore do least harm if 
we use these values. We have therefore, apparently, three conditions to 
be satisfied by the values chosen for the argument: they must be such 
that the uncertainties of the estimated values of the function at them 
are independent, and such that neither is affected by the curvature. 
There are only two quantities to satisfy these conditions, but it turns 
out that they can always be foimd. 

Let X be the independent variable, y the dependent one. Suppose 
that the summary values are to be at and Xj, where y takes the 
values yi and y^. Then the general quadratic expression that takes 
these values is 

Xi— Xa 

in which y^, y^, and A can be found by least squares. The weight of 
the equation of condition for a particular x being u\ the normal equation 
for j/i is 

Xg)^i— 2u:(x— Xi)(x— X2)ya '2w{x—x^){x—XiYA 

(Xi Xa)^ Xj — Xa 

== lMx-X2)y ^ .g, 
x^— Xa 

The conditions that the uncertainties of y^, y^, and A shall be inde- 
pendent are therefore 

2 w{x—Xi){x—Xz) = 0, (3) 

2 w(x— Xi)(x— Xa)* = 0, (4) 

2 t4?(x— Xi)*(x— Xa) = 0. (6) 

But if we subtract (5) from (4) and cancel a factor Xj— Xg from all 
terms we obtain (3). Hence we have only two independent equations 
to determine Xj and Xj and the problem has a solution. 

Put now 

^w = n, ^ivx = nx, x—x~i, = ny,^, == nfig. (6) 

Then (3) becomes 

0 = 1 ^a) (7) 

since = 0. Also either of (4) or (6) with this gives 

= 0 . 


(8) 



201 


§4.6 APPBOXIMATE METHODS AND SIMPLIFICATIONS 

A 

Hence and are the roots of 

1,^ = 0. (9) 

M2 

and this is the solution required. 

The sum of the weights of and is easily seen to be n. For 

2to(a:-Xi)2+ = 2 ^i)“+ 

= 4«/i2 + «'Pi/p2> (10) 

(.rj— = (^ 1 + 12 )^— ~ (11) 

and the sum of the weights is the ratio of these two expressions, as we 
see from the first term in (2). This gives a useful check on the arithmetic. 

In practice it is not necessary' to u.se the exact values of Xj and Xj. 
Approximations to them will suffice to make the correlation between 
the errors negligible, and the curvature, in any case small in the type 
of problem considered, will make a negligible contribution. The most 
convenient method of solution will usually be to solve by fitting a 
linear function as usual and to find y^ and y^ and their uncertainties 
by the usual method. If desired we can use to test the fit at other 
values, and if there is a clear departure from the linear form we may 
either estimate a curvature terra or use shorter intervals. The latter 
course is the more convenient, since the curvature if genuine can be 
found more accurately later by comparing different ranges. 

In practice it is convenient to begin by referring all values of x to 
an arbitrary zero near the middle of the range. Then the normal equa- 
tions to find a linear form 

y =. o+bx (12) 

7ia-\-b'^u'x ^'^wy, (13) 

a 2 ?ex-f 6 2 = 2 (14:) 

and the second, after eliminating a, gives 

6(2 tex*— (2 ‘wx)^jn} = 2) 2 “’V- (1®) 

The coefficient of b is 

2 M'>(^+x) 2— 7?.xa = 2 = ny^, (16) 

so that /tig is found by simple division in the ordinary course of a least 
squares solution. If we write 

2 = nAg, 


(17) 



202 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

we have nX^ — ^ = nfi 3 +Znfi 2 X-{-nx^, (18) 

and therefore /ia = X^—SfA-^x—x^. (19) 

The solution is easy and, even if the function is capable of being re- 
presented by a polynomial, nearly the whole of the original information 
is preserved in the summary values. These will not in general be equally 
spaced, but interpolation can then be done by divided differences. f 
The method has been extensively used in seismology, where the original 
intervals and weights were usually unequal. With this method this 
introduces no difficulty. One feature was found here that may have 
further application. The curvature terms are rather large, but the higher 
ones small. For both P and S waves the times of transmission were 
found to be fitted from about 20° to 90° by quadratics, within about 
1/160 of the whole range of variation, though inspection of the small 
residuals against them showed that these were systematic. Convenient 
quadratics were therefore subtracted from the observed times, and 
linear forms were fitted to the departures from tliese for the separate 
ranges. Summary values were found at distances rounded to the 
nearest multiple of 0-5°, and added to the quadratics at these distances, 
and finally the whole was interpolated to 1". There was no previous 
reason why quadratics should give so good a fit, but the fact that they 
did made further smoothing easier. J 

The choice of ranges to summarize is mainly a matter of convenience. 
The only condition of importance is that they must not be long enough 
for a cubic term to become appreciable within them, since its values at 
Xi and x^ will not in general vanish. This can be tested afterwards by 
comparing the divided differences of the summary values with their 
uncertainties. If the third differences are found significant it may be 
worth while to use shorter ranges; if not, we may get greater accuracy 
by taking longer ones. 

A solution has been found for the problem of finding three summary 
values from a quadratic determined by least squares, such that their 
imcertainties are independent of one another and their values unaffected 
by a possible cubic term.§ It has not, however, been found so far to 
give enough improvement to compensate for the increased complication 
in the arithmetic. 

4.6. Correction of a correlation coefficient. In a common class of 

t Whittaker and Robinson, Calculus of Obaervations, ch. ii; H. and H. S. Jeffreys, 
Methods of Mathematical Physics, 237-41. 

t M.N.B.A.S. Oeophys. Suppl. 4, 1937, 172-9, 239-40. 

$ Proc. Comb. Phil. Soc. 33, 1937, 444-60. 



§4.6 APPROXIMATE METHODS AND SIMPLIFICATIONS 203 

* 

problem the observations as actually recorded are affected by errors 
that affect the two variables independently, and whose general magni- 
tude is known from other sources. They may be errors of observation, 
and it is a legitimate question to ask what the correlation would be if 
the observations were made more accurate. The observations may have 
been grouped, and we may ask what the correlation would be if the 
original data were available. We represent these additional sources of 
error by standard errors a^, and continue to use a and t for the ideal 
observations of which the available ones are somewhat imperfect modi- 
fications. But now the expectations of x^, y^, and xy will be 
t^+Tq, poT, since the contributions of the additional error to x and y 
are independent. A normal correlation surface corresponding to these 
expectations will still represent the conditions of observation if the 
additional error is continuous. If it is due to grouping we can still use 
it as a convenient approximation. But for this surface the proper scale 
parameters and correlation coefficient will be 

a' — t' = p’ = parlaT. (1) 

Now we have seen for one unknown that the best treatment of a known 
component of the standard error is to continue to use the daja rule for 
the prior probability of the whole standard error, merely truncating it 
so as to exclude values less than the known component. Consequently 
the analysis for the estimation of the correlation coefficient stands with 
the substitution of accented letters as far as 3.8 (10). Thus 

If then (Tq and Tg are small compared with a and r, it will be possible, 
within the range of probable values of the parameters, to take the prior 
probabilities of p and p' proportional; and then we can apply the (z, ^') 
transformation to r and p' as before. The result may be written 


5r I 1 
2n=^V(«-l)’ 


( 3 ) 


from which the probability distribution of p’ follows at once. To derive 
that of p we must multiply all values of p' by the estimate of a'r'jar, 
which will be , 




{s2-(7*)V*(<a-rg)VC 


The procedure is thus simply to multiply the correlation and its 
uncertainty, found as for the standard case, by the product of the 
ratios of the uncorrected and corrected standard errors in the two 



304 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

variables. Where the additional variation is due to grouping, this is the 
product of the ratios without and with Sheppard’s corrections. 

This device for correcting a correlation coefficient has been derived 
otherwise from consideration of expectations; but there is a complica- 
tion when the correlation is high, since it is sometimes found that the 
‘corrected’ correlation exceeds 1. This means that the random variation 
has given an r somewhat greater than p', which is already high, and if 
the usual correction is applied we are led to an impossible result. The 
solution is in fact simple, for the only change needed is to remember 
that the prior probability of p is truncated at ±1. We have therefore 
only to truncate the posterior probability at p = ±1 also. \i ^lr > \ 
the probability density will be greatest at p = 1 . 

Such treatment is valid for one estimation, but when many have to 
be combined there is a complication analogous to that for negative 
parallaxes in astronomy (cf. p. 142). The data must always be combined 
before tnmcation. To truncate first and then take a mean would lead 
to systematic underestimates of correlation. 

4.7. Rank correlation. This method, introduced by Spearman and 
modified by Pearson, is extensively used in problems where a set of 
individuals are compared in respect of two properties, which either are 
not measurable or whose measures do not follow the normal law even 
roughly. The chief applications are in psychology, where there are few 
definite standards of measurement, but it is possible to arrange indivi- 
duals in orders with respect to two or more abilities. Then the orders 
can be compared without further reference to whether the abilities have 
received any quantitative measure at all, or if they have, whether this 
measure follows a normal law of chance. It is clear that if one ability 
is a monotonic function of the other, no matter how the measures may 
be made, the orders will either he the same or exactly opposite, so that 
the amount of correspondence between the orders will indicate the 
relation, if any, between the abilities. Spearman’s proposal, then, was 
to assign numbers 1 to n to the observed individuals in respect of each 
ability, and then to consider the differences between their placings. If 
X and y are the placings of the same individual, the coefficient R was 
definedf by 

R == 1 


\x—y\ 

n^—l 


( 1 ) 


This coefficient has a peculiarity; If the orders are the same, we have 
t BrU. Joum. Paych. 2, 1906, 89-108. 



APPROXIMATE METHODS AND SIMPLIFICATIONS 


206 


S 4.7 

2 \‘X—y\ — 0, and R 
members, „ 

X 

1 

2 

3 

4 


and 


= 1. But if they are opposite we have, for four 


y 

4 

3 

2 

1 


R = 


3X8 

16 


- 0 - 6 . 


\x-y\ 

3 

1 

1 

3 

8 


Thus complete reversal of the order does not simply reverse the sign 
of R. This formula has been largely superseded by another procedure 
also mentioned by Spearman, namely that we should simply work out 
the correlation coefficient between the placings as they stand. The 
mean being in each case, this will be 


which can also be written 

r = 1 ^l(x-y)^ 

n®— n 


( 3 ) 


This expression is known as the rank correlation coefficient. It is -j- 1 
if the orders are the same and — 1 if they are opposite. 

The formula needs some modification where some individuals in 
either series are placed equal. A formula for the correction is given by 
‘Student ’t but it is possibly as easy to work out r directly, giving the 
tied members the mean number that they would have if the tie were 
separated. 

The rank correlation, while certainly useful in practice, is difficult to 
interpret. It is an estimate, but what is it an estimate of? That is, it 
is calculated from the observations, but a function of the observations 
has no relevance beyond the observations unless it is an estimate of a 
parameter in some law. Now what can this law be? For r = 1 and 
r = — 1 the answer is easy; the law is that each ability is a monotonic 
function of the other. If the abilities are independent, again, the 
expectation of r is 0, and if r is found 0 in an investigation it will natur- 
ally be interpreted as an indication of independence. But for inter- 
mediate values of r the interpretation is not clear. The form (2) itself 
is the one derived for normal correlation; but the normal correlation 


t Biometrika, 13, 1921, 263-82. 



206 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


surface has a maximum in the centre and an infinite range of possible 
values in all directions. In a given experiment any combination of 
these might occur. But x and y have a finite range of possible values, 
each of which they can take once and only once. The validity of the 
form (2) in relation to x and y therefore needs further examination, 
r may be an estimate of some parameter in a law, but it is not clear 
what this law can be, and whether r will be the best estimate for the 
parameter. 

To illustrate the point, suppose that a pair of large samples from 
different classes have been compared. A pair of smaller samples is taken 
from them at random. What is the probability distribution of r for 
the comparison of these small samples ? Except for some extreme cases, 
nobody knows; but we should want to know whether it depends only 
on the value of r for the comparison of the large classes, or whether it 
depends also on finer features of the relative distribution. In the latter 
case, if we had only the small samples, r found from them will not be 
a sufficient statistic for r in the large samples. 

Pearsonf has investigated the relation of r to normal correlation. If 
we consider the two laws 


P{dxdy \ a^,a 2 ^p,H) 


27rc7i<72V(l— jO*) 


( 4 ) 

( 5 ) 


both give the same total chance of x or of y separately being in a given 
range. Consequently we can introduce two functions (called by Pearson 
the grades) 


X = 


/ 


— 00 



dx. 



dy (6) 


and eliminate x and y in favour of X and Y . Then the right of (4) is 
simply dXdY for X and Y between 0 and 1. Then (5) expressed in 
terms of X and Y gives a distribution within a square, and showing 
correlation between X and Y. Further, such a transformation would 
be possible for any law of chance; we simply need to take as new 
variables the chances that x and y separately are less than given values. 
The result will not be a normal correlation surface in either case, and 
there appears to be no reason to suppose that it would always be of the 


t Drapers' Co. Research Mems., Biometric Series, 4, 1907, 1-39. 



§4.7 APPROXIMATE METHODS AND SIMPLIFICATIONS 


207 


same functional form. Nevertheless, one property of normal correlation 
will persist. The exponent in (5) can be written 


1 pyV 

2(l-p2)\U oj 



(V) 


and we can take x' — x—pa^yja^ and y as new variables. These will 
have independent chances, and then if p tends to 1 the standard error 
of x' will tend to 0 and that of y to o^. Thus in the limiting case the 
normal correlation surface reduces to a concentration along a line and 
y is strictly a monotonic function of x. Analogous relations hold if p 
tends to — 1. But then X and Y will be equal, since x and y are pro- 
portional. 

An analogous transformation applied to any other law will make X 
and Y equal if x and y are monotonic functions of each other, not 
necessarily linear, and r will be -[-1 or —1. Now it seems to me the 
chief merit of the method of ranks that it eliminates departure from 
linearity, and with it a large part of the imcertainty arising from the 
fact that we do not know any form of the law connecting x and y. For 
any law we could define X and Y, and then a new x and y in terms of 
them by (6). The result, expressed in terms of these, need not be a 
normal correlation surface, but the chief difference will be the one that 
is removed by reference to orders instead of measures. 

Accordingly it aj)pears that if an estimate of the correlation, based 
entirely on orders, can be made for normal correlation, it may be 
expected to have validity for other laws; the same type of validity as 
the median of a series of observations has in estimating the median of 
the law, that is, not necessarily the best that can ever be done, but the 
best that can be done until we know more about the form of the law 
itself. But whereas for normal correlation it will estimate departure 
from linearity, for the more general law it will estimate how far one 
variable departs from being a monotonic function of the other. 

Pearson investigates the expectations of Spearman’s two coefficients 
for large samples of given size derived from a normal correlation surface, 
and gets o 

E(r) = -ain-^^p 
■n 


so that p = 28in(i77r) (8) 

is an estimate of p involving only orders. In terms of R he gets 

p = 2 co8^(\-R)—1. ( 9 ) 

The latter has the larger uncertainty. Little further attention has 



208 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

therefore been paid to R. The expectation of the square of the random 
variation in r leads to the result that, if p is given, the standard error 
of an estimate of p from r would be 

1.0472i^(l + 0-042p2+0-008p«+0-002p«). (10) 

\n 

The corresponding formula for a correlation found directly from the 
measures is {\.—p^)l^ln, so that even for normal correlation r gives a 
very efficient estimate. Pearson comments on the fact that in some 
cases where the distribution is far from normal the value of p found 
from r is noticeably higher than that found from the usual formula, and 
seems to think that the fault lies with r. But if x was any monotonic 
function of y other than a linear one, the usual formula would give p 
less than 1, whereas the derivation from r would be 1. Thus if y — a:® 
for — 1 < a: < 1 , we have 

E(x^ = I Eix^) = E{x.x^) = I 

1 / 21 \'/= 

The ranks method puts x and a;® in the same order and leads to p = 1 ; 
but that is not a defect of the method, because it does not measure 
departure from linearity but from monotonicity, and in its proper sense 
it gives the right answer. The formula based on ^ *2/ measures depar- 
ture from linearity, and there is no inconsistency. Further, there is no 
reason to suppose that with great departures from normality this 
formula gives an estimate of anything particular. 

Pearson is very critical of Spearman in parts of this paper, but I think 
that he provides a very satisfactory justification of his coefficient. 
Spearman has replied,! but does not mention the last point, which I 
think is the chief merit of his method. The rank correlation leads to 
nearly as good an estimate as the product moment in the case where the 
latter is definitely the best estimate. It is also right in cases of complete 
association where y is a monotonic but not a linear function of x. In 
such cases the normal law and normal correlation do not hold, and the 
product moment would suggest imperfect association between x and y. 
It is also right in testing absence of association. For general use where 
the law is unknown and may be far from normal it seems in this respect 
to be definitely better than xy/s^ Its defect is that we still have not 
succeeded in stating just what it measures in general. The normal 
correlation surface is a complete statement of the joint chance of two 
t Brit. Joum. Psych, 3, 1910, 271-96. See also Yule, J. B. Slat. Soc. 70, 1907, 666. 



J4.7 APPROXIMATE METHODS AND SIMPLIFICATIONS 209 

if 

variables, and p is a parameter in this law. The extension to non-normal 
correlation would still require such a law, containing one new parameter, 
leading to an expression for the joint chance of n individuals being 
arranged in any two orders with respect to two abilities, and stated 
entirely in terms of those orders. Such a law has not been found; I 
have searched for possible forms, but all have been either intrinsically 
unsatisfactory in some respect or led to mathematical difficulties that 
I, at any rate, have not succeeded in overcoming. Till this is done there 
will be some doubt as to just what we mean quantitatively, in regard 
to two quantities both subject to a certain amount of random variation, 
by the amount of departure from monotonicity. Should the law involve 
exp{— a I X—Y (} or exp{— a(X— F)*}, for instance, we should be led to 
different functions of the observed positions to express the best value of 
a; and to decide between them would apparently need extensive study of 
observations similar to those used to test whether the normal law of errors 
holds for measures. It cannot be decided a •priori, and untU we have some 
way of finding it by experiment some indefiniteness is inevitable. 

Pearson's formula for the standard error of the correlation coefficient, 
as found for the normal correlation surface by the method of ranks, does 
not give the actual form of the probability distribution, which is far from 
normal unless the number of observations is very large. But his esti- 
mates of uncertainty for the correlation coefficient found by the most 
efficient method in tliis case, and for that found from the rank coefficient, 
have been found by comparable methods, and two functions with the 
same maximum, the same termini at ± 1 , and the same second moment 
about the maximum, are unlikely to differ greatly. It appears therefore 
that we can adapt the formulae 3.8 (25) and (26) by simply multiplying 
the standard error of ^ by 

1-0472(1-1- 0-042p2-t- 0-008p« -(- 0-002p«) 
for the estimated p. 

An alternative method is given by Fisher and Yates. One disadvan- 
tage of the correlation between ranks as they stand is that if we have, 
say, 10 pairs, in the same order, the effect of interchanging members 
1 and 2 in one set is the same as that of interchanging members 6 and 6. 
That is, the correlations of the series 

1, 2, 3, 4, 5, 6, 7, 8, 9, 10 
with 2, 1, 3, 4, 5, 6, 7, 8, 9, 10 

and with 1, 2, 3, 4, 6, 5, 7, 8, 9, 10 

are the same. But if the series are the results of applying ranking to a 

U96.6S p 



210 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


normal correlation surface this is wrong, for the difference between 
members 1 and 2 would ordinarily be much larger than that between 
members 5 and 6. Fisher and Yatesf deal with this by using the ranks, 
as far as possible, to reconstruct the measures that would be obtained 
in a normal correlation. If as before we use X to denote the chance of 
an observation less than x, where x is derived from the normal law with 
a = 1, the chance of p— 1 observations being less than x, n~p greater 
than x-\-dx, and one between x and x^dx, is 


n\ 

(p—l)!(n—p)! 


XP-^l-X)”-^dX, 


and this is the chance that the pth observation will lie in a range dx. 
The expectation of x for the pth observation in order of rank is therefore 


1 

= ? HTT T. f 

(p-l)!(»-p)! J 


and if this is substituted for the rank we have a variable that can be 
used to find the correlation coefficient directly without transformation. 
This avoids the above difficulty. It makes the expectation of the sum 
of the squares of the differences between the actual measures and the 
corresponding x^ a minimum. Fisher and Yates give a table of the suit- 
able values of Xp for n up to 30. The uncertainty given by this method 
must be larger than that for normal correlation when tlie data are the 
actual measures, and smaller than for the correlation derived from 
Spearman’s coefficient, and the difference is not large. Fisher and 
Yates tabulate 2 Fisher tells me privately that the allowance 

would be got by multiplying the uncertainty by (w/^ but the 

proof has not been pubhshed. 

The difference between Pearson’s method and Fisher’s recalls a 
similar problem for one variable (cf. 3.61). Re-scaling may obscure an 
essential feature of the distribution, and presumably will also do so for 
distributions for two variables. I think that what is needed is rather 
some method of analysis, like the use of the median in 4.4, such that 
the results will be as insensitive as possible to the actual form of the law; 
completely insensitive they cannot be. 

A further way of estimating rank correlation is given by Kendall. J 

4.71. Grades and Contingency. The method of ranks can be ex- 
tended to a contingency table classified by rows and columns. Pearson’s 


t Statistical Tables, 1938, pp. 13, 60-1. 

j The Advanced Theory of Statistics, ch. 16, especially pp. 391-4, 403-8. 



§4.7 APPROXIMATE METHODS AND SIMPLIFICATIONS 


211 


analysis actually leads to (8) and (10) by a consideration of the 
correlation between grades, which are the quantities I have denoted by 
X and Y and are called and by him. If the quantities correlated 
are magnitudes and we have a series of measures, then for the normal 
correlation surface X and Y will be read from a table of the error func- 
tion and known for each observation with the same order of accuracy as 
the measures. Then the rank correlation will be the correlation between 
X and Y. If we have the orders of individuals with regard to two pro- 
perties, these provide the estimated X and F, from which we can cal- 
culate the rank correlation and proceed to p, in possibly an extended 
sense if the correlation is not normal. When data have been classified 
the same will hold approximately, on account of the small effect of even 
rather drastic grouping on the estimates. The following table of the 
relation of colours and spectral types of stars provides an example. t 
The spectral types are denoted by x, the colours by y, as follows. J 


y 

1 Helium stars 1 

2 Hydrogen stars 2 

3 « Carinae type 3 

4 Solar stars 4 

a Arcturus type 6 

6 Aldebaran type 6 


7 Betelgeuae type 


White 

White with faint tinge of colour 
Very pale yellow 
Pale yellow 

Full yellow • 

Ruddy 


y 

X 

1 

2 

3 

4 

-5 

6 

Total 

Mean rank X 
100 y. 

1 

125 

146 

8 

3 

0 

0 

282 

-5-9 

2 

168 

196 

14 

0 

0 

0 

377 

-2-6 

3 

3 

97 

23 

8 

6 

0 

137 

0 

4 

0 

41 

77 

33 

29 

0 

180 

-f 1-6 

5 

0 

15 

86 

77 

63 

0 

241 

-f 2-8 

6 

0 

0 

4 

22 

43 

6 

76 

-I-4-4 

7 

0 

3 

2 

39 

19 

5 

68 

-I-51 

Total 

296 

497 

214 

182 

160 

11 

1,360 


Mean rank Y 

-7-6 

-3-6 

0 

-t-2-0 

-f3-7 

+ 4-5 



100 X 










For convenience the zero of rank is taken in the middle of the third 
group for both X and Y , and the ranks given are the means of the 
placings relative to this zero, and divided by 100. Then we find 
2 X - 1004, 2 F = -3003, 

2 X* = 17939, 2 F2 = 26233, 2 + 15914. 

t W. S. Franks, M.N.R.A.S. 67, 1907, 639-42. Quoted by Brunt, Combination of 
Observationa, p. 170. 

t What Franks calls a white star would be called bluish by many observers, who 
would call his second class white. 



212 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

The mean ranks are therefore at = — 0'7, Y = —2-2; to reduce to 
the means we must apply to J corrections 

— 1004x0-7, —3003x2-2, —1004x2-2. Also we must correct 2-3^* 
and 2 for grouping. In the first row for x, for instance, grouping 
has made a contribution of ^ 282(2-82)^ to ^ X®, and so on. It does not 
affect the product systematically. Allowing for this we should reduce 
2 X* and 2 by a further 826 and 1405. Thus the corrected values 
are 

2 X2 = 16410; 2 = 18192; 2 = +13765. 

These give r — -f-0-798. 

To convert to an analogue of the correlation coefficient we must take 
p = 2sin(0-5236x0-798) = 0-812. 

Applying the z transformation we get 

^ = 1-133- 0-003±0-037. 

This uncertainty is a little too low, since it has allowed for grouping, 
which should not be done in estimating uncertainties. This has altered 
both X* and F* by about 5 per cent., and we should increase the 
standard error of ^ by the same amount. Also we should multiply by 
4.7 (10) because we are working with ranks and not measures. This is 
1-09. Hence (ranges corresponding to the standard error) 

i = M30i:0-042 = 1-088 to 1-172, 
p = -f-0-796 to +0-825. 

Brunt, from the above data, using Pearson’s coefficient of mean 
square contingency, gets p = +0-71. The difference is presumably 
due to the skewness of the distribution, the greatest concentration 
being in one corner of the table. I think that my larger value gives a 
better idea of the closeness of the correspondence. But I think that 
the use of this coefficient to estimate association is undesirable for other 
reasons. In a rectangular contingency table be computed 

against the hypothesis of proportionality of the chances in the rows, 
and Pearson defines the mean square contingency by 

= xW, 

where N is the whole number of observations.f He then considers the 
laws for correlations 0 and p, on the former of which proportionality 
would hold, and works out, against the chances given by it, the value 
of supposing the number of observations very large and distributed 

t Drapers' Co. Res. Mema., Biometric Series, 1, 1904. 



§4.7 APPROXIMATE METHODS AND SIMPLIFICATIONS 213 

exactly in proportion to the expectations given by normal correlation p. 
The result for this limiting case is p^/{l—p^); and hence 



is suggested as a possible means of estimating p. Unfortunately in 
practice we are not dealing with limiting cases but with a finite number 
of observations classified into groups, and even if the two variables were 
strictly independent the sampling errors would in general make about 
(m— l)(n — 1), where m and n are the numbers of rows and columns. 
For an actual series of observations will always be positive, and r 
will be estimated by this method as about {(«? — l)(w— if the 
variations are independent. This is not negligible. But also if there are 
any departures from proportionality of the chances whatever, irrespec- 
tive of whether they are in accordance with a normal correlation, they 
will contribute to x^ therefore to the estimate of p®. The excess 
chances might, for instance, be distributed alternately by rows and 
columns so as to produce a chessboard pattern; this is nothing like 
correlation, but the method would interpret it as such. Or there 
might be a failure of independence of the events, leading to a tendency 
for several together to come into the same compartment; an extension 
of the idea that we have had in the negative binomial distribution. 
This would not affect the distribution of the expectation, but it would 
increase <f>^. On the other hand, grouping will reduce if the correlation 
is high. Accordingly 1 think that this function, or any other function 
of x^i should be used as an estimate only when the only parameter con- 
sidered is one expressing intraclass correlation or non-independence of 
the events. It is not suited to estimate the normal correlation coefficient 


because too many other complications can contribute to it and produce a 
bias. In the above case, however, the departure from normality itself has 
led to a greater effect in the opposite direction, and in the circumstances 
it seems that this way of estimating association would be best abandoned. 


4.8. The estimation of an unknown and unrestricted integer. The 

following problem was suggested to me several years ago by Professor 
M. H. A. Newman. A man travelling in a foreign country has to change 
trains at a junction, and goes into the town, of the existence of which 
he has only just heard. He htis no idea of its size. The first thing that 
he sees is a tramcar numbered 100. What can he infer about the number 
of tramoars in the town ? It may be assumed for the purpose that they 
are numbered consecutively from 1 upwards. 



214 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


The novelty of the problem is that the quantity to be estimated is 
a positive integer, with no apparent upper limit to its possible values. 
A uniform prior probability is therefore out of the question. For a con- 
tinuous quantity with no upper Umit the dvjv rule is the only satisfactory 
one, and it appears that, apart from possible complications at the lower 
limit, we may suppose here that if n is the unknown number 

P{n\H)cc (1) 


Then the probabihty, given n, that the first specimen will be number m 
in the series is 


and therefore 


P(m \n,H) = 1/n (m <; n) 

P(n \m,H)oc (n > m). 


( 2 ) 

(3) 


If m is fairly large the probability that n exceeds some definite value 
will be nearly 

00 I /yn 

P{n > «.„ I m, ff) == 2 n-^ = — , (4) 

no+1 ’ m ^^0 

nearly. With one observation there is a probability of about ^ that n is 
not more than 2m. 

I have been asked this question several times and think that an 
approximate solution may be worth recording. The interesting thing 
is that the questioners usually express a feeling that there is something 
special about the value 2m, without being able to say precisely what it 
is. The adopted prior probability makes it possible to say how a single 
observation can lead to intelligible information about ra, and it seems to 
be agreed that it would do so. I see no way, however, of fixing the terms 
of order 

The extension to the case where several members of the series are 
observed is simple, and is closely analogous to the problem of finding 
a rectangular distribution from a set of measures. 


4.9. Artificial randomization. This technique in experimental design 
has been greatly developed by Fisher,! more recently by Yates,! 
chiefly in relation to agricultural experiments. The primary problem 
in the work is to compare the productivities of different varieties of a 
plant and the effects of different fertilizers and combinations of ferti- 
lizers. The difficulty is that even if the same variety is planted in a 
number of plots and all receive the same treatment, the yields differ. 
Such tests are called uniformity trials. This would not affect the work 

t The Design of ExperimenU, 1936. 

i J. R. SttJt. 8oc. Suppl. 2, 1936, 181-223; The Design and Analysis of Factorial 
Experiments, Imp. Bur. of Soil Science, 1937. 



$ 4.9 APPROXIMATE METHODS AND SIMPLIFICATIONS 216 

if the yields were random; if they were, the plot yields could be taken 
as equations of condition for the varietal and treatment differences and 
the solution completed by least squares, thus obtaining the best possible 
estimates and a valid uncertainty. Unfortunately they are not random. 
In uniformity trials it is habitually found that there is a significant 
gradient in the yield in one or other direction on the ground. Even when 
this is estimated and taken into account it is found that there is a marked 
positive correlation between neighbouring plots. Further, many fields 
have at some stage of their history been laid out for drainage into a 
series of parallel ridges and furrows, which may leave a record of them- 
selves in a harmonic variation of fertility. The result is that the analysis 
of the variation of the plot yields into varietal and treatment differences 
and random error does not represent the known facts; these ground 
effects must be taken into account in some way. The best way, if we want 
to get the maximum accuracy, would be to introduce them explicitly as 
unknowns, form normal equations for them also, and solve. Since the 
arrangement of the plots is at the experimenter’s disposal, his best plan 
is to make it so that the equations for the various unknowns will be 
orthogonal. One of the best ways of doing this is by means of the Latin 
square. If the plots are arranged in a 5 x 5 square to test five varieties, 
and each variety occurs just once in each row and each column, the 
estimates of the differences between the varieties will be the differences 
of the means of the plots containing them, irrespective of the row and 
column differences of fertility. But unfortunately the correlation be- 
tween neighbouring plots still prevents the outstanding variation from 
being completely random. If it was, all Latin squares would be equally 
useful. But suppose that we take Cartesian coordinates of position at 
the centre of each square, the axes being parallel to the sides. Then if 
variations of fertility are completely expressed by the row and column 
totals they are expressible in the form 

F = %+alX+a2X-+a3X^+a^x*+b^y+h2y^+bsy^+b^y*. 

For with suitable choices of the a’s and 6’s it will be possible to fit all 
the row and column totals exactly. But this contains no product terms, 
such as xy. In certain conditions this might be serious; for if x^ and y® 
produce a significant variation it would only be for one special orienta- 
tion of the sides that the xy term would be absent, and if the plots 
containing one variety all correspond to positive xy and aU containing 
another to negative xy, part of the difference between the means for 
these sets of plots will be due to the xy term in the fertility and not to 



216 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 


the differences of the varieties. This will happen with the most obvious 
design, namely 


A 

B 

c 

D 

E 

E 

A 

B 

G 

D 

D 

E 

A 

B 

C 

C 

D 

E 

A 

B 

B 

C 

D 

E 

A 


Here varieties C and 1) have positive or zero xy every time, while A 
has negative or zero xy every time. If, then, the x^ and y^ terms should 
be eliminated, should we not estimate and eliminate xy too? On the 
face of it it will usually be more important than higher terms such as 
X*-, but the real question is, where are we to stop ? If we should keep 
the whole of the terms up to the fourth power, we shall need to elimi- 
nate 6 extra terms, leaving only 6 to give an estimate of the random 
variation; if we should go to x*y* we should be left with no information 
at all to separate varieties from fertility. We must stop somewhere, 
and for practical reasons Fisher introduces at this stage another method 
of dealing with xy, which leaves it possible to use the plot means 
alone to estimate the varietal differences and at the same time to treat 
the outstanding variation aa if it were random, though in fact it is not. 
Possibly it is often an unnecessary refinement to eliminate the higher 
terms completely, as he does, but the analysis doing so is easier than 
it would be to omit them and find the lower ones by least squares, and 
it does no harm provided sufficient information is left to provide a 
good estimate of the uncertainty. But there might be a serious danger 
from xy. In a single 5x5 square each variety occurs only 5 times, and 
some of this information, effectively 1-8 plots per variety, is sacrificed 
in eliminating the row and column fertility effects. But if we use the 
usual rules for estimating uncertainty they will suppose that when we 
have allowed for rows, columns, and varieties, the rest of the variation 
is random. If there is an xy term, this will be untrue, since the sign of 
this term in one plot will determine that in every other. With some 
arrangements of the varieties the contributions to the means of the 
plots with the same variety due to xy will be more, with others less, 
than would be expected if they were completely random contributions 
with the same mean square. Consequently it will not be valid to treat 
the outstanding variation as random in estimating the uncertainty of 
the differences between the varieties, xy could be introduced explicitly, 




§4.9 approximate METHODS AND SIMPLIFICATIONS 


217 


with an unknown coefficient to be foimd from the data, and then on 
eliminating it the results would be unaffected by it. But this would 
mean appreciable increase of labour of computation, and the possibility 
of still higher terms might then have to be considered. 

Again, it is usual to lay out two or three squares to reduce the un- 
certainty. If the.same design was used for three squares there would be 
a J chance that every variety would have ^ xy for its plots with the 
same sign in every square. This is not a negligible chance; and though 
the differences of the J for fhe varieties in one square might be un- 
important, their contribution to the estimated total differences would 
be multiplied by 3 in three squares, while their contribution to the 
estimated standard error of these totals, assuming randomness, would 
only be multiplied by V3. Thus if the design is simply copied, and an 
xy term is present, there is an appreciable chance that it may lead to 
differences that would be wrongly interpreted as varietal. 

Fisher proceeds, instead of determining the xy term, to make it into 
a random error. This is done by arranging the rows and columns of 
every square at random. Thus if we start with the arrangement given 
above, we have in the first column the order AEDCB. By a process 
such as card shuffling we rearrange these letters in a new order, such 
as CADEB. The rows are then rearranged, keeping each row intact, 
so as to bring the letters in the first column into this order. The letters 
in the first row are now in the order CDEAB. Shuffling these we get 
ECBAD; and now rearranging the columns we get the final arrange- 
ment 



The varieties would be laid out in this order in an actual square; but 
for the second and third squares entirely separate rearrangements must 
be made. There is no such thing as an intrinsically random arrange- 
ment. The whole point of the design is that if there is an xy term in 
the fertility, its contribution to any varietal total in one square shall 
give no information relevant to the total in another square. Card 
shuffling is fairly satisfactory for this purpose because one deal does give 
little or no information relevant to the next. But if the deal is simply 




218 APPROXIMATE METHODS AND SIMPLIFICATIONS Chap. IV 

copied the terms in xy for one square will give information about their 
values in the others, and the shuffling fails in its object. An arrange- 
ment can only be random once. 

This procedure, highly successful in practice, shows well the condi- 
tions for the use of artificial randomization. In the first place, the 
square is not randomized completely. The rule that each variety shall 
occur just once in every row and in every column is absolute. If 25 
cards were lettered, 5 with A, 5 with B, and so on, and shuffled, the 
result would be that some letters would be completely absent from some 
columns and appear two or three times in others. The result would be 
a loss of accuracy in the estimation of the linear gradients, which could 
therefore not be allowed for with so much accuracy, and this would 
increase the final uncertainty of the varietal diiferences. Here is the 
first principle: we must not try to randomize a systematic effect that 
is known to be considerable in relatioji with what we are trying to find. 
The design must be such that such effects can be estimated and elimi- 
nated as accurately as possible, and this is done best if we make an error 
in an unknown of either set contribute equally to the estimates of all 
unknowns of the other sets. But this condition imposes a high degree 
of system on the design, and any attempt at randomness must be within 
the limits imposed by this system. In some discussions there seems to 
be a confusion between the design itself and the method of analysing 
the results. The latter is always to take the means of the plot yields 
with the same variety to give the estimates of the varietal differences. 
It is not asserted that this is the best method. If the xy term was allowed 
for explicitly the analysis would, in general, be more complicated, but 
elimination of the variation due to it would leave results of a higher 
accuracy, which would not, however, rest simply on the differences of 
the means. The method of analysis deliberately sacrifices some accuracy 
in estimation for the sake of convenience in analysis. The question is 
whether this loss is enough to matter, and we are considering again the 
efficiency of an estimate. But this must be considered in relation to the 
purpose of the experiment in the first place. There will in general be 
varietal differences; we have to decide whether they are large enough 
to interest a farmer, who would not go to the expense of changing his 
methods unless there was a fairly substantial gain in prospect. There 
is, therefore, a minimum difference that is worth asserting. It is, how- 
ever, also important that differences asserted should have the right sign, 
and therefore the uncertainty stated by the method must be substan- 
tially less than the minimum difference that would interest the farmer. 



§4.9 APPROXIMATE METHODS AND SIMPLIFICAtlONS 219 

So long as this condition is satisfied it is not important whether the 
probability that the difference has the wrong sign is 0-01 or 0-001. The 
design and the method of analysis are therefore, for this purpose, com- 
bined legitimately, provided that together they yield an uncertainty 
small enough for interesting effects not to be hidden by ground effects 
irrelevant to other fields and deliberately ignored. Previous experi- 
ments have usually indicated the order of magnitude of the uncertainty 
to be expected, with a given design, and it is mainly this that determines 
the size and number of the plots. This information, of course, is vague, 
and Fisher and Yates are right in treating it as previous ignorance 
when they have the data for the actual experiment, which are directly 
relevant. But it has served to suggest what effects are worth ehminating 
accurately and what can be randomized without the subsequent method 
of analysis, treating them as random, giving an uncertainty too large 
for the main objects of the experiment to be fulfilled. In different condi- 
tions, however, the effects that should be ehminated and those that may 
be randomized and henceforth treated as random will not necessarily 
be the same.f 

The same principles arise in a more elementary way in the treatment 
of rounding-off errors in computation. If an answer is wanted to one 
decimal, the second decimal is rounded off so that, for instance, 1-87 
is entered as 1-9 and 1-52 as 1-5. If the rejected figure is a 5 it is 
rounded to the nearest even number; thus 1-55 is entered as 1-6 and 
1-45 as 1-4. Thus these minor errors are made random by their associa- 
tion with observational error and by the fact that there is no reason 
to expect them to be correlated with the systematic effects sought. If 
rounding-off was always upwards or downwards it would produce a 
cumulative error in the means. 

Most physicists, of course, will envy workers in subjects where un- 
interesting systematic effects can be randomized, and workers dealing 
with phenomena as they occur in nature will envy those who can design 
their experiments so that the normal equations will be orthogonal. 

t See also ‘Student’, Biometrika, 29, 1938, 363-79; E. S. Pearson and J. Neyman, 
ibid. 29, 1938, 380-8; E. S. Pearson, ibid. 30, 1938, 159-79; F. Yates, ibid. 30, 1939, 
440-66; Jeffreys, ibid. 31, 1939, 1-8. 



V 

SIGNIFICANCE TESTS: ONE NEW PARAMETER 


‘Which way ought I to go to get from here ? ’ 

‘That depends a good deal on where you want to got to,’ said the Cat. 

‘I don’t much care where ’ said Alice. 

‘Then it doesn’t matter which way you go,’ said the Cat. 

LEWIS CARROLL, AUcc in Wonderland. 


5.0. General discussion. The general principles of significance tests 
have been stated at the beginning of Chapter III. We need only recall 
that our problem is to compare a specially suggested value of a new 
parameter, often 0, with the aggregate of other possible values. We 
do this by enunciating the hypotheses q, that the parameter has the 
suggested value, and q', that it has some other value to be determined 
from the observations. We shall call q the null hypolhesis, following 
Fisher, and q' the alternative hypothesis. To say that we have no 
information initially as to whether the new parameter is needed or not 
we must take I j ^ ^ ( j j 

But q' involves an adjustable parameter, a say, and 


P{q’\H) = ^P{q',oc\U) (2) 

over all possible values of a. We take a to be zero on q. Let the prior 
probability of da, given q'P, be f(a) da, where 



(3) 


integration being over the whole range of possible values when the 
limits are not given explicitly. Then 


P(q’da\H) = if{a)da. (4) 

We can now see in general terms that this analysis leads to a significance 
test for a. For if the maximum likelihood solution for a is a^s, the 
chance of finding o in a particular range, given q, is nearly 


and the chance, given q' and a particular value of a, is 


P{da\q'aH) 


1 

V(27r)« 



2«2 / 


da. 


(6) 



§6.0 SIGNIFICANCE TESTS: ONE NEW PARAMETER 221 

Hence by the principle of inverse probability 

P(,;&|<.«)oc^/(«)e.p{-<2gi’j*. (8) 

It is to be understood that in pairs of equations of this ty^je the sign 
of proportionality indicates the same constant factor, which can be 
adjusted to make the total probability 1. 

Consider two extreme cases. There will be a finite interval of a such 
that J /(a) da through it is arbitrarily near unity. If a lies in this range 
and s is so large that the exponent in (8) is small over most of this 
range, we have on integration, approximately, 

Piq' I aH) == P(q | aH) oc (9) 

In other words, if the standard error of the maximum likelihood esti- 
mate is greater than the range of a permitted by q', the observations 
do nothing to decide between q and q'. 

If, however, s is small, so that the exponent can take large values, 
and /(a) is continuous, the integral of (8) will be nearly /(o), and 

P(q\aH) 1 / 

P(q'\aH) ■ ^{2n)sf{a) ‘ 28^)- ^ • 


We shall in general write 

P(q\eH) 

P{q’\0Hy 


( 11 ) 


If the number of observations, n, is large, s is usually small like n~^^. 
Then if a = 0 and n large, K will be large of order since f(a) is 
independent of n. Then the observations support q, that is, they say 
that the new parameter is probably not needed. But if \a | is much larger 
than s the exponential will be small, and the observations will support 
the need for the new parameter. For given n, there will be a critical 
value of ajs such that K — 1 and no decision is reached. 

The larger the number of observations the stronger the support for q 
will be if |a| < s. This is a satisfactory feature; the more thorough the 
investigation has been, the more ready we shall be to suppose that if 
we have failed to find evidence for a it is because a is really 0. But it 
carries with it the consequence that the critical value of a/s increases 
with n (though that of a of course diminishes); the increase is very slow. 



222 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

since it depends on ^J(logn), but it is appreciable. The test does not 
draw the line at a fixed value of ajs. 

The simplicity postulate therefore leads to significance tests. The 
diflBculty pointed out before (p. 103) about the uniform assessment of 
the prior probability was that even if a was 0, a would usually be 
different from 0, on account of random error, and to adopt a as the 
estimate would be to reject the hypothesis a = 0 even if it was true. 
We now see how to escape from this dilemma. Small values of \a\ up 
to some multiple of s will be taken to support the hypothesis a = 0, 
since they would be quite likely to arise on that hypothesis, but larger 
values support the need to introduce a. In suitable cases high proba- 
bilities may be obtained for either hypothesis. The possibility of getting 
actual support for the null hypothesis from the observations really 
comes from the fact that the value of a. indicated by it is unique. 
q' indicates only a range of possible values, and if we select the one that 
happens to fit the observations best we must allow for the fact that it 
is a selected value. If |a| is less than s, this is what we should expect on 
the hypothesis that a is 0, but if a was equally likely to be anywhere in 
a range of length m it requires that an event with a probability 2s jm 
shall have come off. If \a\ is much larger than «, however, o would be 
a very unhkely value to occur if a was 0, but no more unhkely than any 
other if a was not 0. In each case we adopt the less remarkable 
coincidence. 

This approximate argument shows the general nature of the signifi- 
cance tests based on the simplicity postulate. The essential feature is 
that we express ignorance of whether the new parameter is needed by 
taking half the prior probability for it as concentrated in the value 
indicated by the null hypothesis, and distributing the other half over 
the range possible. 

The above argument contemplates a law q containing no adjustable 
parameter and a law q' containing precisely one. In practice we usually 
meet one or more of the following complications. 

1. q may itself contain adjustable parameters; q' contains one more 
but reduces to g if and only if the extra parameter is zero. We shall 
refer to the adjustable parameters present on g as old parameters, those 
present on q' but not on g as new parameters. 

2. q' may contain more than one new parameter. 

3. Two sets of observations may be considered. They are supposed 
derived from laws of the same form, but it is possible that one or more 
parameters in the laws have different values. Then q is the hypothesis 



§ 6.0 SIGNIFICANCE TESTS: ONE NEW PARAMETEK 223 

* 

that the parameters have the same value in the two sets, q' that at 
least one of them has different values. 

4. It may be already established that some parameters have different 
values on the two laws, but the question may be whether some further 
parameter differs. For instance, the two sets of data may both be 
derived from normal laws, and the standard errors may already be 
known to differ; but the question of the agreement of the true values 
remains open. This state of affairs is particularly important when a 
physical constant has been estimated by totally different methods and 
we want to know whether the results are consistent. 

5. More than two sets of observations may have to be compared. 
Several sets may agree, but one or more may be found to differ from 
the consistent sets by amounts that would be taken as significant if 
they stood by themselves. But in such cases we are picking out the 
largest discrepancy, and a discrepancy of any amount might arise by 
accident if we had enough sets of data. Some allowance for selection 
is therefore necessary in such cases. 

5.01. Treatment of old parameters. Suppose that there is one 
old parameter a; the new parameter is j8, and is 0 on g. In q' we could 
replace a by a', any function of a and j3; but to make it explicit that 
q' reduces to q when ^ = 0 we shall require that oc' — a when ^ = 0. 
Suppose that a' is chosen so that a' and ^ are orthogonal parameters 


in the sense of 4.31 ; take 

P{qda \ H) — h(cx)doi, P{q'da'dp\H) = h{cx’)d(x'f{^,cx')d^, (1) 

where = 1. (2) 

For small changes of a' and j3, 

(3) 

If TC is large, we get maximum likelihood estimates a and b for a' and /3, 
and j 

P(dadb I q<xH) oc ^exp[-in{gr„„(a-o)2+gf^^ 62}], (4) 


P(dadb I q'oc'^H) ac — exp[— in{gr„„(Qt'— o)2+g^^(jS-6)2}]. 
P{q I obH) oc I A(a)exp[-H 9 aa(“-«)"+??p 62 }]dct 
ocA(o) y|-^^exp(-Jng^^62), 


(5) 


( 6 ) 



224 


SIGNIFICANCE TESTSt ONE NEW PARAMETER Chap. V 


P{q'\aiH)a: JJ A(o£')/(j8,a')exp[-J»(gf„Ja'-a)2+?^^(^-6)*}]da'dp 


ex h{a)f{b,a) 


2ir 

n^iaoLcemV 


( 7 ) 


This is of the same form as 6.0 (10). To the accuracy of this approxima- 
tion A(a) is irrelevant. It makes little difference to K whether we have 
much or little previous information about the old parameter. /()3, ot') 
is a prior probability density for j3 given a!. 

If a" also reduces to ot when /S = 0, but is not orthogonal to /S for 
small values of jS, we may take 


a" = a'+A)8. 

If instead of (1) we take 


( 9 ) 


P{q‘ dcc“d^ I H) = A(a")/(y8, oi") doc"d^ (10) 

we are led to 

P(q’ \abH) QC J| dot’d^ 

= A(a+A6)/(6,a-i-A6)— ^ (11) 

^\\9aQi9p^) 

provided now that h varies slowly. There will be little change in K if 
b is small and we have little previous information about a" ; so that the 
condition that old parameters shall be taken ortliogonal to the new 
ones makes little difference to the results. But if there is much previous 
information about a" we may have to take account of the variation of 
h(oi") in the range where the exponential is not small, and the disturbance 
of the result may be considerable. 

There is therefore no difficulty in principle in allowing for old para- 
meters. If previous considerations, such as a definite hypothesis or 
even a consistent model, suggest a particular way of specifying them 
on q', we may use it. If not, we can take them orthogonal to the new 
one, because this automatically satisfies the condition that the para- 
meter oc' that replaces a on q' shall reduce to a when )3 = 0; then the 
prior probability of a on g can be immediately adapted to give a suitable 
one for a' on q'. In these cases the result will be nearly independent 
of previous information about the old parameters. 

In the first edition of this book I made it a rule that old parameters 
on q' should be defined in such a way that they would have maximum 
likelihood estimates independent of the new parameter. This was rather 
unsatisfactory because in estimation problems maximum likelihood 



§ 6.0 SIGNIFICANCE TESTS: ONE NEW PARAMETER 226 

«■ 

arises as a derivative principle, as an approximation to the principle of 
inverse probability. It seemed anomalous that it should appear, appa- 
rently as a postulate, in the principles of significance tests. We now see 
that it is unnecessary, but that the notion of orthogonality leads to a 
specially convenient statement of the method; and orthogonal para- 
meters satisfy the rule of the first edition to the accuracy required. 

5.02. Required properties of/(a). To arrive at quantitative results 
we need to specify the function /(a) of 6.0 or/(j3, a) of 5.01. It might 
appear that on q' the new parameter is regarded as unknown and there- 
fore that we should use the estimation prior probability for it. But this 
leads to an immediate difficulty. Suppose that we are considering 
whether a location parameter a. is 0. The estimation prior probability 
for it is uniform, and .subject to 5.0 (3) we should have to take/(a) = 0, 
and K would always be infinite. We must instead say that the mere fact 
that it has been suggested that a. is zero corresponds to some presump- 
tion that it is fairly small. Then we can make a test with any form 
of /(a) whose integral converges. But it must not converge too fast, or 
we shall find that the null hjrpothesis can never be sufficiently decisively 
rejected. We shall deal with this explicitly later. At present we need 
only remark that the effect of a suggestion that a = 0, if it has to be 
rejected, implies much less evidence against large values of a than would 
be provided by a single observation that would give a maximum likeli- 
hood solution a = 0. In cases where a single observation would not 
give strong evidence against large values of o£, it will be enough to use 
the estimation prior probability. 

The situation appears to be that when a suggestion arises that calls 
for a significance test there may be very little previous information or 
a great deal. In sampling problems the suggestion that the whole class 
is of one type may arise before any individual at all has been examined. 
In the establishment of Kepler’s laws several alternatives had to be 
discussed and found to disagree wildly with observation before the right 
solutions were found, and by the time when perturbations began to be 
investigated theoretically the extent of departures from Kepler’s laws 
was reasonably well known, and well beyond the standard error of one 
observation. In experimental physics it usually seems to be expected 
that there will be systematic error comparable with the standard error 
of one observation. In much modem astronomical work effects are 
deliberately sought when previous information has shown that they 
may be of the order of a tenth of the standard error of one observation, 
and consequently there is no hope of getting a decision one way or the 

S6a6.68 o 



226 


SIONIPICANCE TESTS: ONE NEW PARAMETER Chap. V 


other until some hundreds of observations hare been taken. In any of 
these cases it would be perfectly possible to give a form of /(a) that 
would express the previous information satisfactorily, and considera- 
tion of the general argument of 5.0 will show that it would lead to 
common-sense results, but they would differ in scale. As we are aiming 
chiefly at a theory that can be used in the early stages of a subject, we 
shall not at present consider the last type of case ; we shall see that the 
first two are covered by taking /(a) to be of the form 

5.03. Comparison of two sets of observations. Let two sets of 
observations, of numbers n^, n^, be derived from laws that agree in para- 
meters «!,..., a„„ but possibly differ in a parameter Let the values 
of «,„+! in the two be jSj. The standard error of jSj — ^2 as found in 
an estimation problem would be 



Then the first factor in 5.0 (10) will be 



Now if nj is very large compared with we are practically comparing 
the estimate of jSj with an accurately determined value, and (2) should 
be It is, provided /(O) is independent of Wj, and by symmetry 

of « 2 ' 

This principle is not satisfied by two of the tests given in the first 
edition of this book; comparison of two series of measures when the 
standard errors are equal (5.51); and comparison of two standard errors 
(5.53). In these the factor in question was 0(wi-|-7ig)’/“. The prior 
probability of the difference of the parameters on the alternative hypo- 
thesis in these can be seen on examination to depend on n^jn^. The 
method was based on somewhat artificial partitions of expectations. 

5.04. Selection of alternative hypotheses. So far we have con- 
sidered the comparison of the null hypothesis with a simple alternative, 
which could be considered as likely as the null hypothesis. Sometimes, 
however, the use of or or some previous consideration, suggests 
that some one of a group of alternative hypotheses may be right with- 
out giving any clear indication of which. For instance, the chief periods 
in the tides and the motion of the moon were detected by first noticing 
that the observed quantity varied systematicafly and then examining 
the departures in detail. In such a case (we are supposing for a moment 
that we are in a pre-Newtonian position without a gravitational theory 



§ 6.0 SIGNIFICANCE TESTS; ONE NEW PARAMETER 227 

to guide us) the presence of one period by itself would give little or no 
reason to expect another. We may say that the presence of various 
possible periods gives alternative hypotheses q^, q^,..., whose disjunc- 
tion is q\ They are mutually irrelevant, and therefore not exclusive. 
Suppose then that the alternatives are m in number, all with probability 
k initially, and that 

P{q\H) = P{q'\H) = \. (1) 

Since we are taking the various alternatives as irrelevant the proba- 
bility that they are all false is (l—k)”'. But the proposition that they 
are all false is q\ hence 

(l-A:)”* = i (2) 

= 1-2-1''" == llog2, (3) 

m 


if m is large. Thus, if we test the hypothesis q■^ separately we shall have 


P{q\H) ^ 1 

P{q^\H) 2k 


m 

2 log 2 


0'7m, 


(4) 


nearly. If A' is found by taking P{q \H) = P{qi\H), we can correct 
for selection by multiplying K by 0-7wi. 

Where the data are frequencies or the values of a continuous quantity 
at a set of discrete values of the argument, a finite number of Fourier 
amplitudes suffice to express the whole of the data exactly, and the 
procedure would be to test these in order, preferably beginning with 
the largest. An intermediate real period would contribute to more than 
one estimated amplitude, and the true period could then be estimated 
by comparison of adjacent amplitudes. t 

Where the dependent variable is a continuous function and we have 
a continuous record of it, neighbouring values are correlated in any 
circumstances. It would be wrong to treat neighbouring values as sub- 
ject to independent errors. The null hypothesis would be more like a 
statement that a finite number of values are assigned at random and 
that the intermediate ones are represented by the interpolation func- 
tion. The problem is a case of what is now known as serial correlation. 
A method that is often used is to divide the interval into several, do 
separate analyses for each, and estimate an uncertainty by comparison. 

In practice it is rather unusual for a set of parameters to arise in 
such a way that each can be treated as irrelevant to the presence of 


t This method differs appreciably from the ‘periodogram’ method of Schuster, which 
may miss some periods altogether and estimate amplitudes of others that lie too close 
together to be independent. It is essentially due to H. H. Turner. For details see H. 
and B. S. Jefireys, Methods of MaAematical Physics, pp. 400, 421. 



228 


SIGNIFICANCE TESTS; ONE NEW PARAMETER Chip. V 


any other. Even in the above case each period means two new para- 
meters, representing the coefficients of a sine and cosine; the presence 
of a period also would usually suggest the presence of its higher har- 
monics. More usual cases are where one new parameter gives inductive 
reason, but not demonstrative reason, for expecting another, and where 
some parameters are so closely associated that one could hardly occur 
without the others. 

The former case is common in the discussion of estimates of a physical 
constant from different sets of data, to see whether there are any 
systematic differences between them. The absence of such differences 
can be taken as the null hypothesis. But if one set is subject to 
systematic error, that gives some reason to expect that others are too. 
The problem of estimating the numbers of normal and abnormal sets 
is essentially one of sampling, with half the prior probability concen- 
trated at one extreme; but we also want to say, as far as possible, which 
are the abnormal sets. The problem is therefore to draw the line, and 
since K depends chiefly on if is convenient to test the sets in turn 
in order of decreasing contributions to If at any stage we are testing 
the pth largest contribution (p > 1), p—\ have already been found 
abnormal. Suppose that s have been found normal. Then at this stage 
both extreme possibilities have been excluded and the ratio of the prior 
probabilities that the pth largest contribution is normal or abnormal is 
(a-f l)/p, by Laplace’s theory. In practice, if there are m sets, s can be 
replaced by m—p; for if the pth is the smallest abnormal contribution, 
a will be equal to m—p, so that the line will be drawn in the right place. 
Hence K as found in a simple test must be multiplied by {m—p-\-l)jp. 
We can then begin by testing the extreme departure, taking p — I, 
a = m— 1, and therefore multipl 3 Tng Khy m. If the corrected K is less 
than I we can proceed to the second, multiplying this time by {m— 1)/2, 
and so on. There is a complication, however, if the first passes the test 
and the second does not. For the multiplication by m supposes both 
extreme cases excluded already. In testing the first we have not yet 
excluded q, and if we find no other abnormal cases the question will 
arise whether we have not after all decided wrongly that the first was 
abnormal. This can be treated as follows. The factor m arises from 
Laplace’s theory, which makes the prior probabilities of q (no abnormal 
cases) and q' (at least one abnormal case) in the ratio 1 to m. At the 
outset, however, we are taking these probabilities equal, and therefore 
we should multiply K by instead of m. We can start with m; but 
if the second departure tested does not give a corrected K less than 1 



§6.0 SIGNIFICANCE TESTS: ONE NEW PARAMETER 229 

we should return to the first and apply a factor instead of m. It is 
best to proceed in this order, because to apply the factor m* at the first 
step might result in the acceptance of q at once and prevent any use 
from being made of the second largest contribution to which might 
be nearly as large as the first. 

In comparison with the case where the suggested abnormalities are 
irrelevant, the correcting factors to K here are somewhat larger for 
testing the largest contributions to and smaller for the smaller ones. 

The need for such allowances for selection of alternative hypotheses 
is serious. If a single hypothesis is set up for test, the critical value 
may be such that there would be a probability of 0-05 that it would 
be exceeded by accident even if q was true. We have to take such a 
risk if we are to have any way of detecting a new parameter when it 
is needed. But if we tested twenty new parameters according to the 
same rule the probability that the estimate of one would exceed the 
critical value by accident would be 0-03. In tw’enty trials we should 
therefore expect to find an estimate giving K <. I even if the null 
hypothesis was correct, and the finding of 1 in 20 is no evidence against 
it. If we persist in looking for evidence against q we shall always find 
it unless we allow for selection. The first quantitative rule for applying 
this principle was due, I think, to Sir G. T. Walker;! analogous recom- 
mendations are made by Fisher. J 


5.1. Test of whether a suggested value of a chance is correct. An 

answer in finite terms can be obtained in the case where the parameter 
in question is a chance, and we wish to know whether the data support 
or contradict a value suggested for it. Suppose that the suggested value 
is p, that the value on q', wliich is so far unknown, is p' , and that our 
data consist of a sample of x members of one type and y of the other. 
Then on q', p' may have any value from 0 to 1. Thus 

P(gl//)=i P(q'\H)^\, P[dp'\q\H) = dp', (1) 

whence P{q' ,dp' \H) = \dp' . (2) 

Also, if 6 denotes the observational evidence, 


whence 


P{e\qH) = 

P{d\q',p',H) = 

P{q\eH) zc p^{l—p)v, 
P(q',dp' 1 dH) oc p'®(l — p')vdp', 


(3) 

(4) 
( 6 ) 
( 6 ) 


t Q. J. R. Mel. Soc. 51, 1926, 337-46. 
j Statistical Methods for Btsearxh Workers, 1936, pp. 66-6. 



230 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


and by integration 


P{q' I AH) QC J dp' = • 

P{q'\eH) x\y\ 


Hence 


( 7 ) 

( 8 ) 


( 9 ) 


If X and y are large, an approximation by Stirling’s theorem gives 

K = f r^'cxpf {^~'p{^+y)} 

\2iTp(\—p)\ 2{x+?/)p(l 

The following table indicates how K varies with x and y when these 
are small and p = \\ that is, if we are testing whether a chance is even: 


■y)f \ 
i-pjr 


X 

y K 

X 

y 

K 

X 

y 

K 

1 

0 1 

1 

1 

a 

s 

2 

2 

} i 

H 

2 

0 1 

2 

1 

1. 

a 

3 

3 

JLA 

1 u 

3 

0 -1 

3 

1 

5 

4 

4 

4 

.3_L5 

128 

4 

0 A 

4 

1 

li. 

le 

6 

6 


5 

0 A 

6 

1 

JLl 

as 





None of these ratios is very decisive, and a few additional observations 
can make an appreciable change. The most decisive is for x = 6, 
y = 0, and even for that the odds in favour of a bias are only those in 
favour of picking a white ball at random out of a box containing sixteen 
white ones and three black ones — odds that would interest a gambler, 
but would be hardly worth more than a passing mention in a scientific 
paper. We cannot get decisive results one way or the other from a small 
sample. 

The result K — \ for x = 1, y = 0 is interesting. The first member 
sampled is bound to be of one type or the other, whether the chance 
is J or not, and therefore we should expect it to give no information 
about the existence of bias. This is checked by the result K — \ for 
this case. Similarly, if x = y, we have 


K 

and if y is increased to x+ 1 

K == 


x!x! 

(2x+2)! 
x! (x+l)! 






which is the same. Thus if at a certain stage the sample is half and 
half, the next member, which is bound to be of one type or the other, 
gives no new information. 

This holds only M p — If p = |, x = 1, y = 0, we get Z = f ; 
but if X = 0, y = 1 we get K = \. This is because, if the less likely 



§6.1 SIGNlFICANrE TESTS: ONE NEW PARAMETER 231 

event on q comes off at the first trial, it is some evidence against q, 
and if the likely one comes off it is evidence for q. This is reasonable. 

For jo — K first becomes <0*1 for x = 1 ,y = 0, and first becomes 
>10 for X — y — 9,0. To get this amount of support for an even chance 
requires as much evidence as would fix a ratio found by sampling within 
a standard error of . 1/160)’/* = 0-04. It is therefore possible to obtain 
strong evidence against q with far fewer observations than would be 
needed to give equally strong evidence for it. This is a general result and 
corresponds to the fact that while the first factor in 5.1 (9) increases only 
like the second factor, for a given value of p', will decrease like 
exp[— an(p'— p)*], where oi is a moderate constant. We notice too that 
the expectations of x and y on q are (x+y)p and (x+y)(l— p); so that 

y 2 (•r-(x+y)p}- {y— (x+y)(l-p)P ^ {x—{x-\-y)pY 

(.r-fy)p (x+y)(l— p) {x+y)p(l~p) 

and the exponential factor is exp( — This is a general result for 
problems where the standard error is fixed merely by the numbers of 
observations. 

A remarkable series of experiments was carried out by W. F. R. 
Weldonf to test the bias of dice. The question here was whether the 
chance of a 5 or a 6 was genuinely In 315672 throws, 106602 gave a 
5 or a 6. The ratio is 0-337699, suggesting an excess chance of 0-004366. 
We find 

„ /315672\'/* r 1 315672x0-00436621 

j:f j 

=- 476 exp[— 13-539] = 6-27 X 10"“, 

so that the odds are about 1600 to 1 in favour of a small bias. Extreme 
care was taken that a possible bias in the conditions of throwing should 
be eliminated; the dice were actually rolled, twelve at a time, down a 
slope of corrugated cardboard. The explanation appears to be that in 
the manufacture of the dice small pits are made in the faces to accommo- 
date the marking material, and this lightens the faces with 5 or 6 spots, 
displacing the centre of gravity towards the opposite sides and increas- 
ing the chance that these faces will settle upwards. 

The formula for testing an even chance is of great use in cases where 
observations are given in a definite order, and there is a question 
whether they are independent. If we have a set of residuals against an 
assigned formula, and they represent only random variation, each is 
independent of the preceding ones, and the chances of a persistence and 
t Quoted by Pearson, Phil. Mag. 50, 1900. 



232 


SIGNIFICANCE TESTS: ONE NEW PARAMETEB Chap. V 


a change of sign are equal. We can therefore count the persistences and 
changes, and compare the numbers with an even chance. If a number 
of functions have been determined from the data, each introduces one 
change of sign, so that the number of changes should be reduced by the 
number of parameters determined. Similarly, if we have a series of 
events of two types and they are independent, the same rule will hold. 
We may try it on the set of possible results of random sampling given 
in 2.13. For the series obtained by coin-tossing we have 7 persistences 
and 13 changes, giving nearly 


K = 



exp{ — 0-9) = 1-5. 


This may be accepted as a random series. The second series also 
gives 7 persistences and 13 changes and the same value of K\ but if 
we compared each observation with one three places before it we should 
have 18 persistences with no change at all. The next two each give 20 
persistences and K = 2x10“®. The last gives 20 persistences and 6 
changes, and K = j'j nearly. Thus even with these rather short series 
the simple test by coimting persistences and changes gives the right 
result immediately in four cases out of five, and in the other it would 
give it after attention to special types of non-random arrangement, 
possibly with allowance for selection. The test, however, does not 
necessarily make use of the whole of the information in the data. It is 
a convenient and rapid way of detecting large departures, but often 
fails for small ones that would be revealed by a method that goes more 
into detail. 

5.11. Simple contingency. Suppose that a large population is 
sampled with respect to two properties tf) and iji. There are four alter- 
native combinations of properties. The probability of a member having 
any pair may be a chance, or the population may be large enough for it 
to be considered afi one. Then the alternatives, the sampling numbers, 
and the chances may be shown as follows: 


/ /x y\ /pii 

\^<f>.iP \x' y'y ^21 pj' 

The question is, are <f) and ip associated? that is, are the chances out of 
proportion ? If they are in proportion we have hypothesis q, that 

P 11 P 22 ^ PiiPzv (1) 

Whether they are in proportion or not, we can consider the chance of 



SIGNIFICANCE TESTS: ONE NEW PARAMETER 


233 


a member having the property (f>\ let this be a, and the chance of t/i, 

l-oc = oc’, l-;8 = i3', (2) 

we have on q V (3) 

\P2i P22I \“ P °‘P I 

On q', since a and ^ are already defined and their amounts have nothing 
to do with whether ^ and ip are associated, the chances can differ only 
in such a way that the row and column totals are unaltered; hence there 
is a number y such that the set of chances is 


loc^+y ocP'—y\ 
\a'P—y a'p'+y)' 


and this is general, since any set of chances, subject to their sum being 
1, can be represented by a suitable choice of a, j8, y. Since a and ^ are 
chances, zero and unit values excluded, we have 

P{qdaid^ 1 //) = P{q' dad^ \ H) = ^docd^. (5) 

Also PnP 22 -Pi 2 P 2 i = y. (6) 

j3 P' -jS 

^iPlvPl2’P2l) _ > , /„v 

— — 2 — r — — a —a a. =1. (7) 

8{oi,p,y) 111 

Since y is hnearly related to the separate chances it is natural to take 
its prior probabihty as uniformly distributed. But a and ^ impose 
hmits on the possible values of y. With a mere rearrangement of the 
table we can make a < a', /3 < P', Since 

a, (8) 

this makes a the smallest of a, a', /8'. Then the possible values of y 

he between — a/3 and a/3', since no chance can be negative; and 

P(dy I q', a,p, H) = dy/a. (9) 

Hence P{q' dcxd^dy \H) = ^dad^dyfcx. (10) 

In ranges where a is not the smallest of a, /3, a', /S', it must be replaced 
in the denominator by the smahest. 

Now the chances of getting the observed results in their actual order 
are in each case PiiPi 2 PtiP 22 - Hence 

P{q dod/S I eH) oc ocr+Va^'x'+v'px+x'^'y+v ( 11 ) 

P{q' dcxd^y \dH) cc (a/3-}-y)*(®/3'— y)''(«^/3 — y)®’(a';8'+y)>'’dad/3dy/a. (12) 

Integrating the former we get 

D/- 1 iit/x (x+y)\ {x’+y')\ (a:+a:')! (y+t/')! 

{(»+*'+»+»•+ !)!)• ■ 


Hence 


P{q\eH)cc 


(13) 



234 


SIGNIFICANCE TESTS: ONE NEW PABAMETEB Chap. V 


We have 




and the integral of (12) is nearly 


= — 1 , 


(14) 


1 a l~oi 


F(q' I SH) oc j j J J)Ji(oi— i)u)‘'l)fi(l— a— 


0 0 0 
1 


J 


a;!j/! 


- a*+v • (1 — a)®'+'''+^ da 


(a:+2/+l)! (a:'+y'+l)! 

0 

a;!y!a;'!y'! 

(x+y+lXz+y+x'+i/'+2)l’ 


(16) 


K = 


(a:+y+l)! (a:'+y')! (y+a:')! (y+y')! 

xlyla:'!?/'! (x+y+x'+y'+ 1)! 


(x+.v+x'+y'+2). 


(16) 


An approximation has been made in allowing a to range from 0 to 1 , 
since a < ^ < ^; but if x+y is the smallest total, a is about 

x'+t/'+x+y ’ 


and the contribution from the extra range is exponentially small unless 
a and )3 are nearly equal. The exact procedure would be to replace a 
by /3, a', or /S' in ranges where a is not the smallest; thus we have very 
slightly underestimated P(q' | dH) and overestimated K. 

If x' and y' are very large compared with x and y, the chance of ip, 
given is very accurately given by x'f{x'-\-y'). Replacing this by 
p we have 


K = 


(x+y+l)! x'^y'v ^ (x+y+1)! ^^ 
xly\ (x'+y'f+^ x\y\ 


(17) 


which is the same as 5.1 (8). This was to be expected. The present form 
is a little more accurate than a previous one,f in which I integrated 
without reference to the variation of P 11 +P 12 . replacing the latter after 
integration by its most probable value. The result was that the extra 
factor x+y+1 was replaced by x+y. The difference is trivial, but will 
give an idea of the amount of error introduced by the procedure of 
integrating the factors with large indices and replacing those with small 
indices by their most probable values at the end of the work. 

If X, y, x', y' are all large we can approximate by Stirling’s formula; 
then 

^ 1 (x+y+x'+y')3(x+y) r 1 (x+y+x'+y')(xy'— x'y)» I 

l2jr(x+x')(x'+y')(y+y')/ 2 (x+y)(x-f x')(x'4-y')(y+y')J’ 

(18) 


t Proe. Roy. Soe. A, 162. 1937, 479-96. 



I 6.1 SIGNIFICANCE TESTS; ONE NEW PARAMETER 236 

where x-\-y is defined to be the smallest of the four row and column 
sums. The exponential factor is exp(—^;^2). For if we put 

N = 

the four expectations on q, given the row and column totals, are 

{x-\-y){x+x') [x-\-y){y+y') (x-\-x')[x'-\-y') {x'-]ry'){y+y’) 

N ’ " N ' ' N ’ N 


and 


(x-i-y){x^x') _ xy'—x'y 
N N ’ 


the other residuals being equal or equal and opposite to this. Hence 



N N N N 

(x-\-y)(x-\-x') (a:+ 2 /)( 2 /+y ) "*■ {x-\-x'){x' ^y'y {x' +y'){y-\-y' 


Njxy'—x'yf 

{x-iry){x+x')(y-\-y'){x' +y') ' 


(19) 


5.12. Comparison of samples. In the last problem the only 
restriction on the sample is its total number N. If ^ is a rare property, 
we may require a proliibitively large sample to make x and y large 
enough to give a decisive test one way or the other. But it may be 
possible to arrange the sampling so that x-\-y and x'yy' are both large 
enough to be useful, without violating the condition that, given either 
(ft OT cf>, & member has the same chance of being included in the 
sample whether it has ift or ~ (ft. Thus, if we want to know whether 
red hair is more frequent among Englishmen or Scotsmen, we might 
take a sample at random from the population of London, and classify 
the results in a 2x2 contingency table. But if such a sample is to 
contain enough Scotsmen to give much information it will contain more 
Englishmen than it is practicable to classify. We can, however, proceed 
in two other ways. We can sample at random till we have, say, 200 
Englishmen, and after that we can ignore further Englishmen and count 
Scotsmen only, until we have a suitable number of the latter. Or we 
can take a random sample of 200 Englishmen from London, and another 
of 200 Scotsmen from Perth, and compare the two samples. If <f> is the 
property ‘Scottish’ and <f> ‘English’, these methods do not attempt 
to provide information about a, but replace it by two sample totals 
x-\-y and x'-{-y' determined for convenience. 



236 


SIGNIFICANCE TESTS; ONE NEW PARAMETER Chap. V 


On hypothesis q the chance of ^ is the same, given either <j>,H at 
Call this yS. Then 

P{d^\qH) = d^, (1) 

P{e I q,^,H) = (2) 

and {x-\-y+x'~\-y')^ is the expectation of 0’s in a sample of x+y+x'+y' 
in all. To have a valid standard of comparison, if p and p' are the 
chances of 0 on g' . <f)H and on q' <f>.H, we must define a /3 by 

= {x+y+x'+y')^ = ix+y)p+{x'+y')p', (3) 


so that the left side will still be the expectation of the number of 0’s 
in the two samples together. /S has the property that it is orthogonal 
to p—p'. Then both p and p' must be between 0 and I. Within the 
permitted range for p, p' for a given jS can have values from N^j{x'+y') 
to {N^—x—y)J{x'-\-y'). But the most probable value of JNTjS will be 
nearly x-\-x'. The former value will then be permissible if a* < y', and 
the latter if x' > y, and if these are satisfied there will be no further 
restriction. Then 


P{d^\q',H) = d^, (4) 

P(dp\q',^,JJ) = dp, (6) 


^(P,P) ^ x'+y’ 
d(p',p) N 


( 6 ) 


Hence 


= p*(l— p')*''- 


P{q \eH)oc j /S*+*'(l— ^)*'+v'dyS = 
0 


(a:+x')!(y+ y')! 

(x+x'+y+V+l)!’ 


(7) 

( 8 ) 


P{q'\eH) oc 



p*(l — p)>'p'®'(l — p')*'' d^dp 


x'+y 

N 


' J J p*(l— p)''p'* (1— p')*' dpdp' 


(a:+y+l)! (a;'+y'+l)!’ 

g- ^ (x+y+1)! (a^'+y'y- (y+y')! (x'+y'+l)iy 

xty! x'!y'!(df+l)!(x'+y') ’ ^ 


which differs from 6.11 (16) only by quantities of order l/(x'4-p'). 

5.13. If x' < y (more strictly, if (a;'+y')p' < (»+y)(l— p)) the pos- 



{ 6.1 


SIGNIFICANCE TESTS: ONE NEW PARAMETER 


237 


Bible values of p impose a further restriction, since the largest possible 
value of p is now N^f(x+y). Then (4) and (6) still hold, but 

P{dp\P,q',H)=.^dp, ( 1 ) 

and we are led to 

P{q' \eH)oc^ ^ J J yn-prp'-'ii-pYdpdp' (2) 

and at the maximum of the integrand j3 = (x-\-x')/N nearly. Hence 

A' = i y+y'V- /gx 

x\ y\ x'\ y'\ (x+y+x'+y')! 

nearly; and with errors of order l/(x-|-x') this is the same as we get 
by interchanging x+x' with x+y in 5.11 (16) and 5.12(10) according 
to the altered sign of their difference. 

5.14. The actual agreement is rather closer, as we can see by study- 
ing the case where /3 is very small. In this case we may be led to the 
Poisson rule, and to the rule P{d ^ ! H) oc d/3/j8 instead of the usual uni- 
form one. But the discrepancy in the results, such as it is, consists of 
a replacement of x-\-x'-\- 1 by x-fx', and this, if genuine and not merely 
an error of approximation, should persist when y and y' are very large. 
The range permitted to p will still be restricted to /3, but it is best to 
insert a function /(j3) to generalize the prior probabihty of j8. Then we 


shall have 

1 

P{q\eH)cc ( 1 ) 

0 

1 

P{q’ I J J -p)«p'^(1-pY d^p. (2) 

0 

If y and y' are large and p and p' small, these reduce to 

1 

P{q\eH)a: ^ /(^)(3*+*'exp{-^(y-fy')} dp, (3) 

0 

1 

P{q' I 0H) QC J J exp{-py-p'y'} dpdp, (4) 

0 

and we have {y+y')P = py+p'y', (6) 

so that we can put 

yp = y'p' = 


(6) 



238 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


where the permitted range of rj is from 0 to 1. Then 

' ' Jy+y’r^' 


P(q'\eH) cc ff m^^‘exp{~^(y+y')}rj-{l—nr 


0 0 




*x' 


A^7\ 


( 7 ) 

and the integrals involving ^ in (3) and (7) are identical whatever the 
form of f(P). Hence ^ gives only an irrelevant factor, and 

(y+y')* 


K~ 
K = 


'\*+x' r 

4- J 


y^y 


(a:+a;'+l)f y^y ' ^' 

x\x'\ * 


( 8 ) 

( 9 ) 


which is correct to 0{y~^, y'~^) and is valid subject to the conditions 
that the Poisson law may be substituted for the binomial. Also it is 
identical in form with 5.1(8); thus the agreement of two small estimated 
chances xj{x-^y) and x'/{x'-\-y') can be tested by the same formula as the 
agreement of a chance xl{x+x') with a predicted one yl(y-\-y')- Thus the 
difference noted in 5.12 is only an error of approximation. It follows 
that however the sample may be taken, the proportionality of the 
chances can be tested by 


K = (a^ 4-a^')!(y+y ')! (a:'+y ')! 

x\y\ x']y'\(x+x'-\-y-\-y')\ 


(10) 

(H) 


where x-\-y means the smallest of the four totals; and the error is 
always of order K/{x'-\-y'). 

Fisher'j' quotes from Lange the following data on the convictions of 
twin brothers or sisters (of like sex) of convicted criminals, according 
as the twins were monozygotic (identical) or dizygotic (no more alike 
physically than ordinary brothers or sisters). The contingency table, 
arranged to satisfy the necessary inequalities, is as follows; 


Monozygotic Dizygotic 

Convicted ... 10 2 

Not convicted ... 3 16 


Then 


13! 13! 17! 18! _ 1 
10!2! 3! 16!30! ~ 171’ 


while the less accurate exponential approximation gives Thus the 
■f Statietical Methods for Besearch Workers, 1936, p. 99. 



SIGNIFICANCE TESTS: ONE NEW PARAMETER 


239 


§ 6.1 


latter, even though Stirling’s formula and logarithmic approximation 
have been applied down to 2! and 3!, is still quite reasonably accurate. 
What we can infer is that, starting without information about whether 
there is any difference in criminality between similar and dissimilar 
twins of criminals, we can assert on the data that the odds on the 
existence of a difference are about 170 to 1. 

Yule and Kendallf quote the following oflScial data on the results of 
inoculation of cattle with the Spahlinger anti-tuberculosis vaccine. The 
cattle were deliberately infected with tubercle germs, a set of them 
having first been inoculated. The table, rearranged, is; 


Died or seriously Not seriously 
affected affected 

Not inoculated ... 8 3 

Inoculated .... 6 13 


12! 14116119! 
Sm 6! 13130! 


0-37, 


the exponential approximation 6.11(18) giving 0'31. The odds are 
about 3 to 1 that inoculation has a preventive effect. 

Tables of factorials are given in Comrie’s edition of Barlow’s tables; 
of their logarithms, up to « = 100, in Milne-Thomson and Comrie, 
Standard Four -figure Tables, Table VI. 

The following comparison was undertaken to see whether there is 
any relation between grammatical gender and psychoanalytic symbol- 
ism. The list of symbols in Freud’s Introductory Lectures was taken as 
a standard, and the corresponding words were taken from Latin, Ger- 
man, and Welsh dictionaries. All synonyms were included; I considered 
consulting experts in the languages for the usual w'ords, and using the 
German words from the original edition of the book, but this, I thought, 
might introduce a bias, and I preferred in the first place to use the 
whole of the synonyms. The counts were as follows: 


Latin 

Oerman 

Welsh 


M. F. 

N. 

M. 

F. 

N, 

M. F. 

Male 

27 17 

4 ^ 

31 

14 

7 

46 30 

Female 

10 37 

16 

16 

20 

16 

28 29 


In the first place we ignore neuters and reduce the matter to three 2x2 
tables. The respective values of x* are 16-07, 10’78, and 1-56. Using 
the approximate formula 6.11 (18) we get K = 1/296, 1/30, and 3-7 for 
Latin, German, and Welsh respectively. The phenomenon is so striking 

t Introduction to the Theory of StatisUos, 1938, p. 48. 









240 


SIGNIFICANCE TESTS: ONE NEW PARAMETEB Chap. V 


in the two former that a relation between symbolism and gender in 
them must be considered established, though we see that it is far from 
being a complete association. It would be more striking still if we 
combined all three languages, but many words have been adopted 
from one to another or from common sources, keeping their genders, 
and the data would not be independent. The association is somewhat 
stronger in Latin than in German; this is some evidence against the 
possibility that Freud was guided by the gender in German in his 
classification. 

The non-significant association in Welsh is comprehensible in relation 
to the other two languages when we inspect the neuters, for Welsh is 
a two-gender language like French and the primitive neuters have been 
made masculine. But we notice both in Latin and German a marked 


tendency for male symbols to avoid the neuter gender ; there is a decided 
preference to make them feminine rather than neuter. On the other 
hand, a female symbol is somewhat more likely to be neuter than 
masculine. But when the neuters are made masculine this effect partly 
counteracts the association between symbolism and masculine or 
feminine gender. Thus the failure to detect the association in Welsh 
is not due to the absence of association but to the fact that the greater 
parts of two genuine effects have been made to cancel by an etymo- 
logical rule. 

The German rule that diminutives are neuter may provide part of 
the explanation; the three genders may stand originally for father, 
mother, and child. But this cannot be pursued further here. The im- 
mediate result is that the gender of names of inanimate things is not 
wholly haphazard. 

5.15. Test for consistency of two Poisson parameters. It may 
happen that two experiments are such that the Poisson rule should 
hold, but that the conditions on q predict a ratio for the two para- 
meters; the question is whether the data support this ratio. Thus in 
either case the joint chance of the numbers of occurrences in the two 


series will be 




( 1 ) 


but on 5' we are given r/r' = a/(l— a), (2) 

and we can introduce b such that 


while on q' 


r = ab, 
r — ot>. 


r’ = (1-0)6; 
r' = (l-a)6. 


(3) 

W 



J6.1 SIGNIFICANCE TESTS: ONE NEW PARAMETER 241 

and it now appears that a must be between 0 and 1. Then 

P(q, db\H)= f(b) db, P{q' dbdoc | H) - /(6) dbda, (6) 
P{e 1 q, b, H) oc a*(l— (6) 
P{d\q',b,oL,H)cc a*(l— (7) 
_P{qdb 1 6H) oc /(6)a*(l— (8) 
P{q' dbdct I OH) oc f(b)o^{l — (9) 

Integration with regard to b gives the same factor in both cases, and 


Hence 


X 

^ = J a*(l — a)-*' da -r a“(l — a)* , 


K 

K 


(a:+a:'-f 1)! 
xlx'l 


a^(l— a)*'. 


(10) 

( 11 ) 


This is the same result as 5.14 (9), but does not depend on the sampling 
theory of the Poisson rule. It would have several apphcations where 
this rule arises. In the case of radioactivity, if n is the number of 
atoms in a specimen, and the chance that a given atom will break up 
in time dt is A dt, the expectation of the number in time t is n)d. Here 
n would be fixed by the mass of the specimen and the atomic weights, 
and t by the experimental conditions, while A is to be found. The need 
for a significance test would arise if there was a question whether high 
pressure, temperature, or cosmic rays affected A. The experiments 
might not involve the same values of n and t, but the expectations, on 
hypothesis q, that there is no effect would be in the known ratio ntin't'. 
The teat would therefore be given by 

^ (x+x'+l)! 
xlx'l 

In the Aitken dust counter, a question might be whether two samples 
of air are equally dusty. If the same apparatus is used to test both, 
a if not, aj{l—a) is the ratio of the volumes of the samples taken. 

Again, two specimens of rock might bo compared to see if they are 
equally radioactive, a-particle counts being the data. The masses m, m' 
of the specimens and the times t, t' of the experiments would not in 
general be the same ; the expectations of the numbers of disintegrations 
on the h 5 rpothesi 8 that U and Th constitute the same fractions of the 
specimens will be in the ratio mtjm't'. This question would seldom arise 
in practice, since it is highly exceptional for two rooks to have the 
same radioactivity, but it might arise if there was such a question for 

two specimens from the same dike. 

saas.ss 


B 



242 SIGNIFICANCE TESTS: ONE NEW PABAMETER Chap. V 

5.2. Test of whether the true value in the normal law is zero : 
standard error originally unknown. If a is the standard error and 
A the true value, A is 0 on q. We want a suitable form for its prior 
probability on g'. From considerations of similarity it must depend on 
a, since there is nothing in the problem except a to give a scale for A. 
Then we should take 

P{q'dcT\H)oc-, (1) 

a 


where 


Piq'd<jdX\H)ocf(^]^-, 

\al a a 


00 



00 


( 2 ) 

(3) 


If there are n observations 


Then 


P{d\q,o,H) oc ff-»exp| — 

P{e\q',o,X,H) oc a-»exp|^-£^{(x-A)Hs'^}]. 
P{qda 1 6H) oc <7-”-^exp|— da, 


W 

(5) 

( 6 ) 


P{q' dadX I dH) oc /|-j<7-«-2exp 




dadX. (7) 


We should expect that for n = 1 no decision would be reached in the 
absence of previous information about a and A, since the departure of 
a single measure from zero could be interpreted equally well as a random 
error or as a departure of A from zero. We should also expect that for 
n ^ 2, K would be 0 if s' = 0, x 0; for exact agreement of even two 
observations would be interpreted as an indication that a = 0 and 
therefore X = x ^ 0. 

If a' = 0, X 0, take x positive, and put 


<7 = x/t, X = av = xvjr. 


(8) 


P(q I ^^1) oc J exp(— (9) 
0 


P{q*\0H)oc 


0 


( 10 ) 


Then 



§6.2 SIGNIFICANCE TESTS; ONE NEW PARAMETER 243 

(9) converges for all n ^ 1. If m. = 1 and/(«) is any even function, 

OD 00 

P(q' \eH)oc~j dr J /(v)[exp{— ^(t)— T)2}+exp{— ^( v+t)®}] dv 
0 0 


00 00 

“ i/ / 


/(w)exp{— t)®} dv 


Also from 


0 

(9) P(qieH)ocl^, 

jL X 


( 11 ) 

( 12 ) 


and therefore K — 1. Hence the condition that one observation shall 
give an indecisive result is satisfied if f{v) is any even function with 
integral 1. 

If n ^ 2, the condition that K = 0 for «' = 0, x ^ 0 is equivalent 
to the condition that (10) shall diverge. For v large and positive 


OQ 

/ 


T"exp(— Jw(r— t)®}— ~ Nv”-^, 


(13) 


where A is a function of n. This integral is bounded for small v. For 
V negative it is exponentially small but positive. Hence (10) diverges 
if and only if «, 

J /(r)v"-‘ dv (14) 


diverges. The simplest function satisfying this condition for n > 1 and 
also satisfying (3) is 

Corresponding to this and (2) 


P(dX\q'aH) = 


v(l+v*)’ 
1 


dX 


w(1-|-A®/(7®) a 


(16) 


In the first edition of this book I used as a parameter a quantity a', 
which would in the present notation be (a*-(-A®)‘^, and would have the 
property that on any set of observations its maximum likelihood esti- 
mate would be the same whether A is assumed zero or not. Then the 
prior probability of A was taken uniform with respect to a'; hence 


P(da'dX\q'H)cc^^, 


adtrdX 

2(<7*-i-A»)^' 


(17) 



244 SIGNIFICANCE TESTS: ONE NEW PABAMETER Chap. V 

This does not satisfy (14) for » = 2, as was first found in a detailed 
numerical investigation, which showed that, for n = 2, K could never 
be less than 0*47 however closely the observations agreed.! 

It may be remarked that many physicists totally reject the usual 
theory of errors on the ground that systematic errors are always present 
and are not reduced by taking the mean of a large number of observa- 
tions. They would maintain (1) that the mean of a large number of 
observations made in the same way is not necessarily better than one 
observation, and the only use of making more than one observation is 
to check gross mistakes; (2) that the weighted mean of several series 
of observations is worse than the value given by the best series. It has 
been rejected as inconsistent with the theory of probability, but this 
rejection is associated with the belief that the normal law is the only 
law of probabihty. The behef of the old-fashioned physicist can in fact 
be completely formalized. If the law of error for one observation is a 
Cauchy law about a constant, then the mean of any number of observa- 
tions follows exactly the same law, and his condition (1) is satisfied. 
If, irrespective of the random variation within each series, the location 
parameter for each set has a departure from the true value with a 
probability law given by (16), then the mean of the location parameters 
has a probability distribution of the same form with a scale parameter 
equal to the mean of the separate a, and therefore not less than the 
smallest a. Thus condition (2) is also satisfied. 

On the other hand, detailed study of errors of observation usually 
shows that they are far from following the Cauchy law; the norma] 
law is nearer, and averages fluctuate less than the Cauchy law would 
indicate. Also there are plenty of cases where estimates made by 
different methods have agreed as well as would be expected on the 
hypothesis that the normal law of error holds and that there are no 
systematic errors. The belief of the old-fashioned physicist must in 
fact be regarded as a serious hypothesis, or pair of hypotheses, capable 
of being sufficiently clearly stated to be tested and therefore deserving 
test, according to our rule of 1.1 (5). But actual test shows that they 
are not in general true. We do, however, often find discrepancies. We 
provide for these by taking prior probability \ for no real difference, 
and i for a real difference, and distributing the latter over possible 
values of the difference in such a way that if it is not zero it can always 
be detected and asserted with confidence given sufficient observations. 
The dependence on the standard error indicated in ( 1 6) may be regarded 
t Proe. Boy. Soc. A, 180, 1942, 266-88. 



SIGNIFICANCE TESTS: ONE NEW PAKAMETER 


246 


§ 6.2 


as an expression of the fact that special care in reducing the random 
error will usually be associated with special care in eliminating systematic 
errors. The astronomical case is a special one, since random errors have 
already been reduced as far as they can for most types of observation, 
and progress has long depended mainly on eliminating systematic errors. 
We therefore in our rule of procedure reject the Cauchy law for the 
random variation about the mean. We use it for systematic differences 
except that we allow a non-zero fraction, usually of the total prior 
probability to be concentrated at zero difference. 

The old-fashioned physicist’s view is therefore not nonsensical. It 
consists of two parts, both of which can be clearly stated, but the first 
part is wrong and the second exaggerated. When the second part is 
cleared of exaggeration it leads to a valuable working rule with the 
properties that we require. 

An asymptotic form is easily found for K, when n is large. In (7) the 
large values of the integrand, for given a, are in a range of order 
X == X:^0{al^n). In such a range /(A/ ct) varies httle from its value at 
A == X. Hence we can perform the integration with regard to A approxi- 
mately: „ 

P{q \eH)cc j cr-'*-iexpj-^^(x2+«'2)} da, (18) 

0 


1 .fl, oc J to (19) 

0 

Again, the integrals are of the same form except for the factor in xja, 
which varies slowly. The large values of the second integrand are near 
a = s'. Substituting this value in the slowly varying factor and sup- 
pressing a factor that is the same for both integrals we have 

F(q I ffff) oc (s'^+x^)-^^", (20) 


F(g'l0B)cc- 


1 


b '- I » 


K. 


iT/s! \ n }l-\-x^/s'^ 

IV^' 




x*y 

s^j 


( 21 ) 

( 22 ) 


The error of the approximations is of the order of l/« of the whole 
expression. In terms of 


< — .y/(»— 1 )x/s V = n—1, 



-V»V+V2 


(23) 

(24) 


K 



246 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

The corresponding formula given in the first edition of this book was 

The new value is larger for t small and smaller for t large. We may say 
that the present test is a little more sensitive. 

If iT is very small, so that it is practically certain that A is not zero, 
the posterior probability of a and A is nearly proportional to 

P(q'dXda I dH). 

Comparing (7) with 3.41 (2) we see that the posterior probability is 
nearly the same as in the estimation problem, being obtained to this 
accuracy by changing i> to v— 1. 

The behaviour of K is seen most easily by considering the case when 
V is large enough for the t factor to be replaced by exp(— When 
t = 2 this is 0-135: when < = 3 it is 0-011. In the former case A' — 1 
when V is about 30; in the latter A = 1 when v is about 5,000. The 
variation of K with t is much more important than the variation with v, 
in fact, for given K, t increases like (log vf^^, which is a very slow increase. 
We may say that if < > 3, A will be less than 1, and the introduction of 
the new parameter will be supported, for any number of observations 
that ordinarily occurs. If t = 2, A will be greater than 1 if v > 30, and 
again for small values of v; in the case ofv=2 and t = 2 the formula (8) 
makes A nearly 1 , though the accuracy of the approximation is not to be 
trusted when v is so small. Without elaborate calculation we can then 
say that values of t less than 2 will in most cases be regarded as confirm- 
ing the null hypothesis; values greater than 3 will usually be taken as 
an indication that the true value is not zero. 

The fact that when A is small the posterior probability of a and a is 
almost the same as in the estimation problem is an indication that we 
are working on the right lines. There would be no inconsistency in 
taking f(v) oc where k is some positive constant, but we have 
already seen that if we did so A would never be less than some positive 
function of n however closely the observations agreed among them- 
selves. Similarly the posterior probability of a and cx, even if all the 
observations agreed exactly, would be the same as if there was an 
additional observation of positive weight at a: = 0. In cases where the 
null hypothesis is rejected we should never be led to the conclusion that 
the standard error was near a however closely the observations might 
agree. The chief advantage of the form that we have chosen is that in 
any significance test it leads to the conclusion that if the null hypothesis 



5 6.2 SIGNIFICANOR TESTS; ONE NEW PARAMETEB 247 

has a small posterior probability, the posterior probability of the 
parameters is nearly the same as in the estimation problem. Some 
difference remains, but it is only a trace. 

It is also possible to reduce IjK exactly to a single integral. (18) is 


exactly 


From (7) and (15), with 

A = <jv‘. 


P{q I eH) QC ^ . 


2(7® 


(x*+5'2) = u; 


(26) 

(27) 


P(q'im)oc - f f / 

77 J 1+v^J \n{f*+s'2)j ^ 


( \w(x®4-s'®)/ 2 f 2u 


2^hn -1 


77 5'2)}V» 


OO 


Xexpj — u-\-nvx 


dv r ,, , 


Integrate term by term; odd powers of v contribute nothing to the 
double integral ; and we have 

- 1 ) ! f e-'h.n7' 


P(g'|^If)oc f 

7r(n(x*-(-s'*)}‘''2’* J 


l+t>2 


X 


1 - 2 r 

K~^j 
0 

00 

~ J i 


, , 'y 1) 1 nv^x^ )"• 

^ (2(x*+s'2) 


(29) 


e-Vmv^dv 
“'1 + d2“" 


nv^x^ 

Jg-Van®* 

2(x2+5'2) 

j 1 + 17® 


dv. 


where y, x) denotes the confluent hypergeometric function 

1 ^ Ol(o(+l)^ 

y ■^2! y(y+l) 

By a known identityf 

iPi(Q;,y,x) = e*jFi(y— a,y, —x); 
t H. and B. S. JefEreya, Methods of Mathematical Phyaica, 1946, 676. 


(30) 

(31) 

(32) 



24S 


SIGNIFICANCE TESTS: ONE NEW PARAMETEB Chap. V 


hence an alternative form of IjK is 


1 Tts’^V^ 

1 dv 

( 2(x24-5'2) 

)l+w* 


(33) 


5.21. Test of whether a true value is zero : a taken as known. 

Since (16) is taken to hold for all a, we can use it when a is already 
known; then 

P{q \ OH) oc exp^ (34) 

03 


J (.m) 1 + ’ 

(35) 


(36) 


The method used in the first edition failed to cover this case, but there 
are many applications where the standard error is so well known from 
collateral evidence that it can be taken as known. 


5.3. Generalization by invariance theory. We have seen that for 
the normal law a satisfactory form of the prior probability is 


P{dX I q'aH) = 


d\ 


( 1 ) 


Now both /g and J of 3.9 (14), (15), when C = 0, are functions of A/tr; in 


-TT^^ = -dtan-i{-8log(l-i/g)f»=-idtan-iJ’/s (3) 

where the square roots are taken to have the same sign as Xja. The 
relation to J is much simpler than the relation to /g. 

We could therefore make it a general rule in significance tests to 
express the new parameter in terms of /g or J calculated for comparison 
of the null hypothesis with the alternative hypothesis, and use prior 
probabilities of either as given by (3). If the inverse tangents do not 
range from — ^ to as in cases where the new parameter can take 
only one sign, correcting factors will be needed. We therefore have 
possible general rules for significance tests. These rules, however, 



§ 5.3 


SIGNIFICANCE TESTS: ONE NEW PARAMETER 


249 


disagree with those that we have used in problems of sampling, and 
our first task must be to see whether they will give satisfactory solutions 
in those cases. 

For the comparison of two sets of chances 


a(l — \ I oi^+Y «(1— ^)— y \ 

(l-a)i3 (l-a)(l-i3)/’ l(l-a)^-y (l-«)(l-i9)+y| ^ 


we find 




V loc M+y){(l-ct)(l-i8)+y} 
{a(l— ^)— y}{(l— a)/?— y} ' 


(5) 


This would, by the rule just suggested, be suitable to give a prior 
probability distribution for y in a contingency problem. Suppose, on 
the other hand, that we take a sample of ^’s and ~<^’8, of given 
numbers n^, n^, from the class. The chances of iji and given <(> and 
respectively, will be (y being 0 on g) 

O+y/o:, 1— j3— y/a), 

{/3-y/(l-ot), l-j3+y/(l-«)}. 

Comparing these two pairs of chances we find 

j = y io, (o^^+y){(l-«)(l-i3)+y) _ 

^ a(l— a) ^)— y}{(l — a)^— y} a(l — a)’ 



If we took samples of tp’a and 
we should get similarly 

J3 = 


0’s and counted the <f>’8 and 


<f>% 

(8) 


To satisfy the condition that the significance test, for given sampling 
numbers, should be nearly independent of the conditions of sampling, 
the prior probability of y, given a and j3, should be the same in all 
cases. Hence we cannot simply use J universally. But we can define 
a J that would be equal, for given y, in the three cases, and with the 
proper properties of symmetry, by taking either 


or 


Jj, a(l — alJa. 


(9) 

(10) 


The first set are plainly unsatisfactory. For J tends to infinity at the 
extreme possible values of y ; hence if the estimate of y is a small quan- 
tity c it wiQ lead to 

y(^exp(-ix*). 

where N is the sum of the sample numbers. This conflicts with the rule 



250 


SIGNIFICANCE TESTS: ONE NEW PAEAMETER Chap. V 


of 6.03 that the outside factor should be of order where a;+y 

is the smallest of the row and column totals. On the other hand, the 
second set are consistent with this rule. 

A minor objection is that two pairs of chances expressed in the form 
(6) do not suffice to determine a, and y, and there is some indeter- 
minacy as to what we shall take for ^8 in (10). But so long as y is small 
it will make little difference what value between jS+y/a and y/(l — a) 
we choose. 

is much less satisfactory in this problem. There is no simple exact 
relation between the values of in the three comparisons made. Also 
/g takes finite values (not 2) for the extreme possible values of y if 
neither a. nor )3 is 0 or 1. It appears therefore that cannot be made 
to satisfy the conditions by any linear transformation. In view of the 
greater complexity of the expression in in (3) than of that in J, it 
appears unnecessary to pay further attention to at present. 

An objection to J , even in the modified form, is that if the suggested 
value of a chance is 1 comparison with any other value gives J infinite. 
Consequently the rule based on J in (3) would concentrate the whole 
of the priof probability of the chance in the value 1 on the alternative 
hypothesis, which thereby becomes identical with the null hypothesis. 
Of course a single exception to the rule would disprove the null hypothe- 
sis deductively in such a case, but nevertheless the situation is less 
satisfactory than in the analysis given in 5.1. It might even be said 
that the use of anything as complicated as J in so simple a problem as 
the testing of a suggested chance is enough to condemn it. 

It appears to be worth recording the asymptotic forms given by (10) 
in the problems of 5.1. We find without much difficulty 


K 

K 


~ {■7T{x+y)pp'yi‘‘exp{ — ^X^) for 6.1 (9), 


for 5.11(18), 5.12(10), 6.13(3). 


An evaluation of K has also been made for the problem of 6.11, using 
the estimation prior probabilities given by the invariance rules. It was 
again of the order of N'/^. These attempts at using the invariance theory 
in sampling problems, therefore, confirm the suggestion of 3.9 (p. 163) 
that there is nothing to be gained by attempting to ensure general 
invariance for transformation of chances; uniform distribution within 
the permitted intervals is more satisfactory, as far as can be seen at 
present. We shall, however, use the rule based on J in the more 



§5.3 SIGNIFICANCE TESTS: ONE NEW PARAMETER 251 

complicated cases where there is no obvious suggestion from more 
elementary ones. 

5.31. General approximate forms. We see from 3.9(3) that if 
a new parameter a is small, 

J = ( 1 ) 

and if a can take either sign, the range of possible values being such 
that J can tend to infinity for variations of a in either direction, 

= = ( 2 ) 

7r(l + t/) 7T 

for OL small. If n observations yield an estimate a = a, where na^ can 
be neglected, 

log L = (3) 

Hence in 5.0(4) we can put 

/(“) = s = l/(ng„„)’&, (4) 

and then K ~ 

If at can take values only on one side of 0, (2) must be doubled for a 
on that side, and if a also is on that side the value of K given by (5) 
will be approximately halved. If a is on the other side of 0, the approxi- 
mate form fails; we shall see that K may then be of order n instead of 
The approximate form will be adequate for practical purposes in the 
majority of problems. Closer approximations are needed when n is 
small; for instance, in problems concerned with the normal law the 
need to estimate the standard error also from the same set of observa- 
tions may make an appreciable difference. But if n is more than 50 or 
so (5) can be used as it stands without risk of serious mistakes. 

5.4. Other tests related to the normal law. 

5.41. Test of whether two true values are equal, standard 
errors supposed the same. Tliis problem will arise when two sets of 
observations made by the same method are used to detect a new 
parameter by their difference. According to the rule that we are adopt- 
ing, any series of observations is suspected of being subject to distur- 
bance imtil there is reason to the contrary. When we are comparing 
two series, therefore, we are really considering four hypotheses, not two 
as in the test for agreement of a location parameter with zero; for 
neither may be disturbed, or either or both may. We continue to 
denote the hypothesis that both location parameters are A by q, but q' 



262 


SIGNIFICANCE TESTS: ONE NEW FABAMETER Chap. V 


is broken up into three, which we shall denote by q^, q^, q^^. With an 
obvious notation we therefore take 


P(q dcfdX 1 H) oc dadXja, 

P(q^ dadXdX^ \H) oc ^dadX-, --f^.-r^, 

P(q^ dodXdX^ \H)oz- 


A)2’ 


( 1 ) 

(2) 

(3) 


On g’l, Aj = A; on q^, A^ = A. On since A does not appear explicitly 
in the likelihood, we can integrate with regard to it immediately: 


Also 




(5) 


P(0 1 aA, A, fl) cc a-".-. exp j - ^^{x,-X,f - 

V = 711 +^ 2 — 2 ; vs^ — nis'-^+n^Sz- 


(6) 

(7) 


Put 
Then 

P(qdadX I eH) oc a-"‘-"«expj-^(Xi-A)2-^(x2-A)2-|^j^~ 

( 8 ) 

with corresponding equations for q^, q^, and q-^^. It is easy to verify 
that the posterior probabilities of all four hypotheses are equal if 
= 1, 712 = 0 or if should expect. If and tij 

are large we find, approximately, 

P{q I eH) : P(q, | dH) : P(q, 1 OH) : P(q,, | dH) 


fn Til Tig 


Til Tig (Xi-Xg)*! 

\271i+»iJ 1 

+ «■ n 

Tli + Tlg S® 1 


1 s^+{x^-x^)^ 

■ ■ ■ 2 sa+i(f,-X2)*- 


The equality of the second and third of these numbers is of course exact. 
The last ranges from ^ to 2 as 1 Xj^—x^ |/« increases from 0 to infinity. 
Thus the test never expresses any decision between q^ and a,s we 
should expect, and never expresses one for gij against g^ v gj. It expresses 
a slight preference for g^ against g^ or q^ separately if | |/« > V2. 

But there is so little to choose between the alternatives that we may 



§6.4 


SIGNIFICANCE TESTS: ONE NEW PARAMETER 


263 


as well combine them. If x, |/s is small, as it usually will be, we 
can write 


P(q\eH) 2lir Uin^V'^ 

, «1«2 +».-!) 

(10) 

6\2ni+7iJ 

\ % + j 

Expressing the standard errors of 

Xj and Xj in the usual way. 


4, = 

4, = 

(11) 

2 

’ 

we can write (10) as 

H 

1 

H 

1 

II 

(12) 

2/77 TiiWa V' 

' 2 / t^\-Vi(v+l) 

(13) 

5\2 ni+nj 

\ 



Decision between the three alternative hypotheses, in cases where this 
ratio is less than 1, will require additional evidence such as a comparison 
with a third series of observations. 

Where there is strong reason to suppose that the first series gives an 
estimate of a quantity contained in a theory and is free from systematic 
error, the alternatives and do not arise, and the factor 2/5 in (13) 
is unnecessary. 

5.42. Test of whether two location parameters are the same, 
standard errors not supposed equal. The method is substantially 
as in the last section, and leads to the following equations: 


P{qdG^d<j^\eH) az 


Xexp 


tt 9 '* 

jOj '*’2^2 


n, s 


(Xi— Xa)* 


2af 2of 2((7f/Wi+o|/re. 




( 1 ) 


T,/ 7 7 1/7 77, 2 / rilSi® J 

1 eil) cc 

( 2 ) 


Piq^da^da^ \ 6H) follows by symmetry, 

2 

P{qi2 dcty do2 1 BH) oc 


C7i "'Uj "* 


The form of the term in (Xj— Xj)’^ in (1) makes further approximations 
awkward for general values of Xj— x^, but we may appeal to the fact 
that K is usually small when the maximum likelihood estimate of a 



264 


SiaNIFICANCE TESTS; ONE NEW PARAMETER Chap. V 


new parameter is more than about 3 times its apparent standard error. 
If we have a good approximation when j fg I is less than 

3V(siV«i+«a*/w2), 

it will be useful up to values that make K small, and as it will be 
smaller still for larger values we cannot be led into saying that K is 
small when it is not. The precise evaluation of K when it is very small 
is not important; it makes no difference to our further procedure if we 
estimate K as 10~® when it is really 10“*, since we shall adopt the 
alternative hypothesis in either case. With these remarks we note that 
if we can replace by where A^, A^ are 

chosen so that the functions and their first derivatives are equal when 
aj = CTj = Sj, the exponent in (1) will be sufficiently accurately 
represented over the whole range where the integrand is not small in 
comparison with its maximum. This condition is satisfied if we take 

_ «!/wi . ^ sjjn^ 

Replacing and by and Sj in factors raised to small powers and 
dropping common factors we find, with vj = nj— 1, vj = 1, 


P(g|0H)ocV(i^) 




(sJ/ni+fli/wj,)* j 


{s\ln,+slln,Y j 


(5) 


P(g'i I m) oc 





(6) 


P(g2 1 oc 
P(q,,ldff)cc 


^2)*’ 

^l~l~^2 

(-»I+«2)*+(^l-*2)®' 


(7) 

( 8 ) 


There may here be considerable grounds for decision between the 
alternative hypotheses q^, We recall that q^ is the hypothesis 
that the first series is disturbed and not the second, and our approxima- 
tions contemplate that | Xj— Xj | is small compared with and s^. 
Then if is much less than a^, P{qi \ 6H) will be much more than 
P(g2 I ^H), and P{5'i2 I will be slightly less than the latter. That is, 
subject to the condition that either series of observations is initially 
regarded as equally likely to be disturbed, the result of many observa- 
tions will be to indicate that the series with the larger standard devia- 
tion is the less likely if the discrepancy is small compared with the 



SIGNIFICANCE TESTS: ONE NEW PARAMETER 


255 


standard errors of one observation. If the approximations remain valid 
(which has not been investigated) the contrary will hold if the dis- 
crepancy between the means is greater than either standard error of 
one observation. 

5.43. Test of whether a standard error has a suggested value 
We take the true value to be 0. If the standard error is a, and 

a = CToci, (1) 

we have from 3.9 (1.5) J = 28inh2^ (2) 

and - dtan-V /2 = dC. (3) 

7T 7rcosh2^ 

Then according to 5.3 (3) we should take 


^ W2cosh2C 

If there are n observations and the mean square deviation from 0 is 

P{d 1 qH) QC C7j-»exp| — (6) 
P(e I q'H) QC a-»exp|-~j, (6) 

P(3|5^f)ocaJ-«exp|-^j, (7) 

P(,'|W)cc^- I ,8) 

— 00 

The factors with n in the index have a maximum when a = s. Put 

s/(To = e®. (9) 

For large n the expression (8) is approximately 

\/2 coshz „ /M 


77- cosh 22 


5-"exp(— |n) /(- 


y(f 


7m\cosh 2z 
2 coshz 


e’“ exp{^n( 1 — e^)} . 


This is greatest when z = 0 and is then ^(^7m). 

If instead of using J we had used as in 6.3 (3), we should have had 
instead of the second of (4) 

and the first two factors in (11) would be replaced by 

^^(7r?i)co8h 2z. (13) 



266 SIGNIFICANCE TESTS: ONE NEW PARAMETER Cliap. V 

An exact form of IjK is 

CO 

0 

where a = a^ju, s = cr^b, b = e^. It is seen that this tends to infinity 
for n — 1 if 6 -> 0 or 6 00. (12) would give for n = I 

00 

i = lf (^ 5 ) 

0 

which tends to a finite limit as 6 -> 0 . ( 1 4 ) is more satisfactory because 
it says that one deviation, if small enough, can give strong evidence 
against q-, ( 15 ) does not. Either gives IjK large if 6 is large. 

It has been supposed that all values of a are admissible on g'; the 
conditions contemplate a theory that predicts a definite standard error 
o-Q, but we may be ready to accept a standard error either more or less 
than the predicted value. But where there is a predicted standard error 
the type of disturbance chiefly to be considered is one that will make 
the actual one larger, and verification is desirable before the predicted 
value is accepted. Hence we consider also the case where C is restricted to 
be non-negative. The result is to change V2 in (8) to 2V2 and make the 
lower limit 0 . The approximations now fall into three types according as 
^ = 2 Mes well within the range of integration, well outside it, or near 0 . 

If 2 > 0 and nz^ is more than 4 or so, the large values of the integrand 
on both sides of the maximum lie within the range of integration and 
the integral is little altered; then the only important change is that K 
as given by ( 11 ) must be halved. 

If 2 = 0, only the values on one side of the maximum he in the range 
of integration and the integral is halved; this cancels the extra factor 2 
and the result is unaltered. 

If 2 < 0 and nz^ is large, the integrand decreases rapidly from ^ = 0 . 
In fact 

ff-«exp|— ‘^^■’‘exp|— ^jexp{— 7 i(l— e-*«)^} (16) 

and J7m(l — e^). ( 17 ) 

The factor n in the last expression instead of the usual needs 
comment. In the usual conditions of a significance test the maximum 
likelihood solution, here ^ = 2, or <7 = s, is a possible value on q' . But 
here we are considering a case where the maximum hkehhood solution 
corresponds to a value of a that is impossible on q' , and is less probable 



SIGNIFICAXCE TESTS: ONE NEW PARAMETER 


267 


on any value of o compatible with q' than on q. Naturally, therefore, if 
such a value should occur it would imply unusually strong support for g. 
Actually, however, such values will be rare, and if they occur they will 
not as a rule be accepted as confirming q, as we shall see later (p. 281). 

In the above treatment the true value has been taken as known. If 
it is unknown (5) and (6) need modification. If we redefine s as the 
standard deviation and put n— 1 = v, integration with regard to X will 
remove a factor I/ctq from (7) and l/a from (8). The result will be that 
a in (11) and (13) will be replaced by v. 

5.44. Test of agreement of two estimated standard errors. We 
shall consider the case where only one of them, is possibly disturbed. 


P(q dcj\H) oc 


P(q' da, da,\H)a:^-l dl ^ , 

7 T cosh 2^ 


P{d\qaH) X exp| — 

P{d I q'a, H) X af "• g-"«exp| 

P{q\eH)cc exp| 

0 

w 1 OH) a J 


n,5|\V2 cosh^ da, 
2cr| / 77 cosh 2^ a, ' 


s, — - ^2 e®. 


00 <30 

P(q'ldff)ccj— J X 


\'2 cosh^ ... 


1 ^ j- 00^ ,9) 

K 77 J cosh 24 \ n,e^+n2 / 

—00 

The factors with large indices have a maximum when ^ = z, and we get 
approximately 


77M.J i 

2(^ 


Vs cosh 2z Wj + Wa 


cosh 2 


/ n, + n^ y 
\n,e^+nj 


K is unaltered if n, and Wj are interchanged and the sign of 2 is reversed. 



258 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


If, in addition, z is fairly small, a further approximation gives 



If either or both of the standard errors is regarded as possibly dis- 
turbed, K can be adjusted as in 5.41 by multiplication by a factor 
between \ and Such conditions might arise when two methods of 
measurement have a great deal in common, but differ in other features, 
and it is uncertain which is the better. 

The more usual types of case where we wish to compare two standard 
deviations for consistency are, first, when it is suspected that some 
additional disturbance has increased the standard error in one set; 
secondly, when methods have been altered in the hope of reducing the 
standard error and we want to know whether they have been successful. 
In the first case we expect ^ if not zero to be positive, in the latter 
negative. We take the former case; then the second of (2) must be 
multiplied by 2 and the range of t, taken to be from 0 to oo. If z is 
positive and ^ 2 

Wi + Wj 

the net result is that K as given by (10) or (11) should be halved. If z 
is negative K may be large of order or n^. 

5.45. Test of both the standard error and the location para- 
meter. If the null hypothesis is that two sets of data arc derived from 
the same normal law, we may need to test consistency of both the 
standard errors and the location parameters. There are cases where we 
need to arrange the work, when several new parameters are considered, 
so that the results will be independent of the order in which they are 
tested. This, I think, is not one of them. The question of consistency 
of the location parameters is hardly significant until we have some idea 
of the scale parameters, and if there is serious doubt about whether 
these are identical it seems nearly obvious that it should be resolved 
first. 

5.46. The following example, supplied to me by Professor C. Teo- 
dorescu of Timisoara, illustrates the method of 5.42. There was a 
suggestion that locomotive and wagon tires might be stronger near the 
edges than in the centre, since they are subjected to more severe working 
there in the process of manufacture. A number of test pieces were cut, 
and a tensile test was made on each. The breaking tension, R, in 
kilograms weight per mm.®, and the percentage extension afterwards, 
A, were recorded. In the first place the whole of the data are treated 



§6.4 SIGNIFICANCE TESTS: ONE NEW PARAMETER 269 

as two independent series, of 150 observations each. For the edge 
pieces the mean strength is found to be 89-59 kg./mm.^, sf, in the 
corresponding unit, 7-274. For the centre pieces the mean is 


i ?2 = 88-17 kg./mm.^, 
sf/ui = 0-04849; sj/n^ = 0-03746; 


= 6-619. 


= 4-l‘42; 

D/ 1,1 s 1 /, . 0-04849 X l-422\-i«/2 

{q\ ) ^/(^"■)^(0.08595)(^”*' 149x0-08595 V ^ 


_/ 0-03746x1- 

149 X 0-085i 


422 \ - 149/2 

085^) 


= 4-27(l-15727)-i«/2 = 7-8x 10-^ 


9-70 

P((7, I dH) OC 

Vil I ) 7.27-)-2-02 


0-29, P(g'2 I dH) OC 


2-37 


5-62-^2-00 


= 0-31, 


P(5i2 1 OH) OC 


6-07 

25-74-2-00 


0-18, 


P{q\m ^ 7-8x10-^ _ 

-P(?iV?2V5'i2 1^^) ■ 0-78 

For the extensions the means were = 12-60 per cent., = 12-33 
I)er cent., with af = 1-505, 1-425; we find similarly 

P{q I dH) OC 9-0x0-1530 = 1-38, 

P(g'i I dH) OC 0-78, P(g '2 | OH) oc 0-81, P{ 7 i 2 1 ^H) oc 0-41, 

Piq\m ^ 1:38 _ 

I^P) ■ 2-00 


Thus there is strong evidence for a systematic difference in the strengths. 
The result for a difference in the extensions is indecisive. 

Since, however, the question asked directly is ‘Has the extra working 
at the edges had a systematic effect?’ it may be held that and 
do not arise and that we need only consider Then for the strengths 
we find 


P{q\eH) ^7-8x10- 


P{qi\dH) 

and for the extensions 


0-29 


= 2-7 X 10- 


P(q\eH) 

P(gij(9P) ■ 0-78 

This way of looking at the data, however, omits an important piece 



260 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


of information, since the pairs of values for different specimens from 
the same, tire were available. There is also a strong possibility of 
differences between tires; that is why testing was undertaken before 
comparison of centres and edges. This was so weU established that it 
can be treated as a datum. But then differences between tires will have 
contributed to the various values of s^, without affecting the differences 
of the means. Hence the above values of K wiU be too high considering 
this additional information. (If this effect was in doubt it could be 
tested by means of the test for the departure of a correlation coefficient 
from zero.) A more accurate test can therefore be obtained by treating 
the differences between values for the same tire as our data, and testing 
whether they differ significantly from 0. For the differences in R we 
find s'^ = 3-790, for those in A,3'^ = 1-610, and we can use the simple 
formula 5.2 (22). Then for R 



K = 


l-422\ ’’^^3^10-13 



\ 2 M 

‘ 3-79/ 

and for A 

K = 

llJOnWl 

0.5*. 



1 2 / \ 

' 1-6 / 


The evidence is now overwhelming for a difference in R and slightly 
in favour of a difference in A. This indicates how treatment of a 
systematic variation as random may obscure other systematic varia- 
tions by inflation of the standard error; but if comparisons for the 
same tire had not been available the first test would have been the 
only one possible. We notice that for R the variation of the differences 
between pieces from the same tire is less than the variation of either 
the centre or the edge pieces separately. For A it is a little greater; 
but if the variations were independent we should have expected the 
mean square variation to be about l-495-|- 1-416 = 2-91 instead of the 
observed 1-61. 

The explanation of the much less decisive result for A even with the 
more accurate treatment may be that while R will depend on the least 
strength of any part of the specimen, the actual process of fracture 
includes a great deal of continuous flow, and while the stronger material 
is under a greater stress in the test it may also be relatively less ductile, 
so that two systematic effects partly cancel. 

5.47. The discovery of argon. Rayleigh’s data| in this investigation 
refer to the mass of nitrogen obtained from air or by chemical methods, 

t Proc. Roy. Soc. 63, 1893, 146; 55, 1894, 340-4. 



§6.4 SiaNIFICANCE TESTS: ONE NEW PARAMETER 261 

within a given container at standard temperature and pressure. All are 
in grams. 


Frmi air. 



By hot copper 

By hot iron 

By ferroug hydrate 

2-31036 

2-31017 

2-31024 

26 

0986 

10 

24 

1010 

28 

12 

1001 


27 



By chemical methods. 



Iron and NO 

Iron and N,0 

NH^NO, 

2-30143 

2-29869 

2-29849 

29890 

29816 

30182 

940 

89 


The respective means and estimated standard errors, the last in units 
of the last decimal, and the standard deviations are as follows: 

From air. 


Method 1. 

2-31026±3-7 

s = 8-2 

2, 

2-31004±6-7 

s = 13-4 

3. 

2-31021±6-5 

a = 9-6 

By chemical methods. 


Method 1. 

2-30008±91 

a = 182 

2. 

2-29904±36 

« = 60 

3. 

2- 29869 ±20 

a = 28 


The variation of s is striking. This is to be expected when several 
of the series are so short. It is plain, however, that the variability for 
chemical nitrogen is greater than for atmospheric nitrogen. The greatest 
discrepancy in the two sets is that between chemical methods 1 and 3, 
and can be tested by the test of 5.44; since a pair of means have been 
estimated we have to replace TCj by = 3, by = 1. At these 
values the accuracy of the approximation 5.44 (10) is of course somewhat 
doubtful, but we may as well see what it leads to. Here 

e» = 182/28 = 6-5, 

and we find K = 1-9. As this is a selected value there seems to be no 
immediate need to suppose the standard error of one determination to 
have varied within either the set from air or the chemical set. We 
therefore combine the data and find the following values. 

Mean » v *«/n 

From air . . . 2-310I7±0 000040 13-7 11 16 6 

By ohemioal methods . 2-29947±0-00048 137'9 7 2378-2 


001070 



262 


SIGNIFICANCE TESTS: ONE NEW PAKAMETEB Chap. V 


Here 


First compare the values of s 

' \2 18 / 10-0 \7xl00+ll, 


7-8x10 


1-7 


The existence of a difference between the accuracies of the determina- 
tions for atmospheric and chemical nitrogen is therefore strongly con- 
firmed. Finally, we apply 5.42 to test the difference of the means; 
taking the unit as 1 in the fifth decimal we get 


P{q I OH) QC 2-1 X 10-», P{qi I OH) cc 0-12 x 10-*, 


P(?2 I 6H) QC 1-Ox 10-^ Plg-ia 1 6»P) a 1-1 X 10-^ 


P(q\eH) 

Piq^wq^vq^^lOH) 


0-92x10-®. 


The existence of a systematic difference between the densities is there- 
fore established. In this case the systematic difference is about eight 
times the larger standard error of one observation. 

A very rough discussion can be done by the methods of contingency. 
The mean of all the data is 2-30978; all the 12 determinations for 
atmospheric nitrogen are more than this, all 8 for chemical nitrogen 
less. The use of a mean for comparison ensures that there will be one 
more and one less than the mean; hence we can allow for one parameter 
by deducting one from each total and testing the contingency table 

j for proportionality of the chances. This gives by 5.14 (10) 

8! 7! 11! 11! 1 

~ 7!0! 0! 11! 18! ~ 3978’ 


n 0 
lo 11 


which would be decisive enough for most purposes. Many problems of 
measurement can be reduced to contingency ones in similar ways, and 
the simple result is often enough. It has the advantage that it does not 
assume the normal law of error. It does, however, sacrifice a great 
deal of information if the law is true, corresponding to an increase of 
the standard error above what would be got by a more accurate investi- 
gation, and therefore usually (always in my experience so far) makes K 
too large. Thus if the rough method gives A < 1 we can assert q', 
but if it gives A > 1 we cannot say that the observations support q 
without closer investigation. 

According to the results the ratio of the densities is l-00465±0-00021, 
effectively on 7 degrees of freedom since most of the uncertainty comes 
from the chemical series. The 0-5, 0-1, and 0-05 points for t are at 0-71, 
1-90, and 2-36. We can compare the result with what more detailed 



SIGNIFICANCE TESTS: ONE NEW PARAMETER 


263 


i 6.4 


determinations of the composition of air give. The percentages by 
volume of Ng and A are 78-1 and 0-93, f giving the density ratio 


Hence 


79x28+0-93x12 

79x28 


= 1-00505. 


21 


which is close to the 10 per cent, point. 

The outstanding problem is to understand the great difference 
between the standard deviations in Rayleigh’s results. 


5.5. Comparison of a correlation coefficient with a suggested 
value. We have seen that even in the estimation problem different 
ways of looking at the correlation problem suggest different ways of 
taking the prior probability distribution for the correlation coefficient. 
If we use the representation in terms of the model of 2.5 we should 
naturally take uniform distribution over the range permitted. If we use 
the rule in terms of J we have to consider whether the old parameters 
should be taken as <j, t or not. These parameters have the property 
that for any value of p they give the same probabihty distributions for x, 
y separately. On the other hand, they are not orthogonal to p. As for 
the testing of a simple chance the differences are not trivial, since the 
outside factor would vary greatly according to the suggested value of p, 
and in different ways. The difficulty is possibly connected with the 
question of the validity of the model and of the normal correlation law 
itself. In many cases where this is used it would be reasonable to regard x 
and y as connected in the first place by an exact linear relation, neither 
of them separately satisfying anything like a normal law, but subject to 
small disturbances which might or might not be normal. The evaluation 
of r in such cases is simply a test of approximate linearity of the relation 
between x and y and has nothing to do with normal correlation. 

Tests relating to normal correlation based on J have been worked 
out, but suffer from a peculiarity analogous to one noticed for sampling; 
if the suggested value of p is 1 or — 1, comparison of the null h3rpothesis 
with any other value of p makes J infinite, and the alternative hypo- 
thesis coalesces with the null hypothesis. Accordingly it seems safer 
to take a uniform distribution for the prior probability of p. We shall 
see that an additional restriction enters in the comparison of two 
correlations, similar to one that arises for comparison of samples, and 

t F. A. Paneth, Q. J. iJ. Afei. Sof. 63, 1937,433-8. Paneth states that the second figure 
for A is uncertain, but the uncertainty suggested would hardly affect the comparison. 



264 


SIGNIFICANCE TESTS; ONE NEW PARAMETER Chap. V 


that the outside factor is always of the order of the smaller of r^, n^. 
In the first place we suppose the distribution of chance centred on 
X = y — (i-, the suggested value of p is p^. Then 

P{q dadr ) H) oc dadTjcn-, ( 1 ) 

P{q' dadrdp \H) oc dadrdp I 2 (tt, (2) 

the 2 entering because the possible range of p is from —1 to +1. The 
likelihoods have the same form as in 3.8, and lead to 


P(q dadr \ OH) 

^ _ r n 

^ ff"+lT»+l(l-pg)V3n®^P[ 2(1— 

P(q' dadrdp \ 6H) 

_ 1 ^ r n /s^ 

2a"+V‘+i{l-p2)'/2™®^P[ 2(1 — p2)\ct^'*'t2' 

With the substitutions 3.8 (5) we are led to 


2pors^ y 

OT j 


dadr. 


2prst' 


)] 


dadrdp . 


(3) 

(4) 


00 

P{q\eH)a: J (I-p§)V2n(coshi8-p„r)-«di9, (5) 

— 00 


CD 1 

P{q' \ 6H)a: \ J J (1— p2)*/a»(cosh|8— pr)-"’ d^rfp. (6) 

-00 -1 

As we only want one term in the result it is convenient to use the 

substitution , o /i \ u 

coshjS— pr = (1— pr)e“ (7) 

instead of the previous one. This leads, on integration with respect to 
u, to 


P{q\eH)oc 


(l-p„r)-V2’ 

1 


Now putting 


-1 


we get 


r = tanh z, p = tanh f , po = tanh 

C08h"“’^“2 


P(q\eH)oc 


P{q'\eH)a:^ J 


cosh'^s^o cosh"-*^»(^o— z) ’ 
cosh"“% dC 


(271-1) 


cosh'/“4 cosh"-’/2(^— z) 
V2 


cosh"“®z 


( 8 ) 

(9) 

( 10 ) 

( 11 ) 


( 12 ) 



§6.5 SIGNIFICANCE TESTS: ONE NEW PARAMETER 206 


for large n ; ^ has been replaced by z in the factor cosh'^**^. Hence 


K 



cosh‘''*2 

cosh'^^^o CO8h"-'/2(^0— 2) 


(13) 

(14) 


If the distribution of chance is centred on a pair of values to be deter- 
mined, instead of on (0, 0), n — 1 must be substituted for n. 

As an example we may take the following seismological problem. 
The epicentres and times of occurrence of a number of earthquakes 
had been determined by BuUen and me by means of a standard table 
of the times of travel of the P wave to different distances. Two other 
phases, known as S and SKS, were studied, and their mean residuals 
for the separate earthquakes were found. f These varied by much more 
than would be expected from the standard errors found for them. Such 
variation might arise if the focal depths of the earthquakes were not 
all the same, since variation of focal depth would not affect the times 
of all phases equally; or if any phase was multiple and there was a 
tendency for observers in some cases to identify the earher, and in 
others the later, of two associated movements as the phase sought. In 
either case the result might be a correlation between the mean S and 
SKS residuals when P is taken as a standard. The individual values, 
rounded to a second, were as follow'S. 

5 SKS 

-8 -10 

-6 -10 

-3 -)-J 

+ 3 -6 

-3 4-1 

+ 3 0 

+ 2 -3 

0 -t-1 

0 -4 

+ 2 0 

The means are — 0-8 for S and —2-0 for SKS. Allowing for these 
we find 

2 {x~xY = 313, 2 iy—yf = 376, 2 {^—x){y—y) = +229; 

s = 4-06, t = 4-45, r = +0-667. 


s 

SKS 

4-6 

4-8 

4-4 

4-1 

— 1 

0 

4-4 

0 

0 

0 

-1 

-1 

-7 

-2 

-8 

-10 

-3 

-4 


There are 19 determinations and a pair of means have been eliminated. 

t Jeffreys, Bur. Centr. Ititem. S£ism. Assn., Trav. Sci. 14 , 1936, 68. 



266 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


Hence n in (14) must be replaced by 18. If there was no association 
between the residuals we should have hypothesis q, with p = 0; and 
we find 


K 


\ 77 / 


667*)’® = 0-040. 


Thus the observations provide 25 to 1 odds on association. Further work 
has to try to find data that will decide between possible explanations 
of this association (it has appeared that both the above suggestions 
contain part of the truth), but for many purposes the mere fact of 
association is enough to indicate possible lines of progress. The later 
work is an instance of the separation of a disjunction as described in 
1.61. Had K been found greater than 1 it would have indicated no 
association and both suggested explanations of the variations would 
have been ruled out. The tables used for comparison in obtaining the 
above data have been found to need substantial corrections, varying 
with distance and therefore from earthquake to earthquake, since the 
bulk of the stations observing S were at very different distances; 
allowance for these corrections would have made the correlation much 
closer. The corresponding correlations found in two later comparisons 
were and +0-97. f 

5.51. Comparison of correlations. Correlations may be found 
from two sets of data, and the question may then arise whether the 
values are consistent with the true correlations being the same in both 
populations. We take the case where two standard errors have to be 
found separately for each set. On hypothesis q the true correlation is p, 
to be found from the combined data; on q' it is in the first set and p^ 
in the second. Let the numbers of observations in the two sets be Wj 
and Tig, where > n^- In accordance with the rule that the parameter 
p must appear in the statement of q', and having regard to the standard 
errors of the estimates of p^ and p^, we may define p on q’ by 


{ni+n2)p = niPi+Wgpj. 

As p 2 ranges from —1 to +1, for given p, p^ ranges from 


{K + ’^2)P + «-2}Mi fo {(»l + «2)p — W2}Mi- 
Both are admissible values if 


( 1 ) 


71^ — 712 7li — 712 

nj-f-TCg ^ ^ ^ 7fl^-\-n2 


(2) 


■f M.N.B.A.S. Qeophys. Su])pl. 4, 1938, 300. 



§8.6 SIGNIFICAIWJE TESTS: ONE NEW PARAMETER 
and the permitted range of is 2. But if 


267 


P > 


(3) 


W 1 +W 2 ’ 

Pi will be +1 for ‘>^ 2 P 2 — i^i+f>' 2 )p~'’^i W 

and the permitted range for pj from this value to 1, a range of 
(wi+7i2)(l~ lpl)/^ 2 - This will apply also if p is too small to satisfy 
(2). Denote the permitted range of pj by c. Then the prior proba- 
bilities are 

P{q doi dr I da^ dr^ dp | H) oc doi dri da^ dr^ dpjoi r^, (6) 

P{q' doi dr I da^ dr 2 dpdp 2 \H) cc doi dug dr 2 dpdpj/cTj Tj^ ctj '^2 (®) 

The hkehhoods are the products of those for the estimation problems, 
and we can eliminate Cj, aj* '’’2 terms of aj, ^i, aj, ^2 before. 

(I_p 2 )’/ 2 (ni+n,) 


Then 


P{qdp I OH) cc 
P(q' dpdp 2 1 dH) oc 


cc 


(1 -pr, )«•-*/:.( l-pr2)«.-v/^’ 

(I-pf)V.n.(i-p|):,,., 

(1— Pl»'l)"‘"''"(l— P2'‘2)''’"''“ C 

(l-pf)-fen.(I_p|)V.n. nidpidp2 

( 1 -Pl TiY'-'H 1 — P2 (Wi-(-7l2)C ’ 

and, using the p = tanh ^ transformation, 

00 

P( I ^ (* scch ^ 

^ J co8h"i“*'=‘(^— Zi)cosh”»-'^“(^— Zg)’ 

— 00 

sech’^^'^i sech’/^^g d^i d^g 


P{q' 1 OH) oc 


JJ 


cosh"'-^/“(Ci— Zi)cosh'‘»-’/a(^g— Zg) (»i+W2)c 


(7) 

(8) 

(9) 

( 10 ) 

. ( 11 ) 


Hence 

P{q\eH) 


eech(^rffl±<!rf^^exp 

^l + «2— 1 


(■ 


(Ki-J)(» 2 — i)(Zi-Z2)2' 


oc( 

\ra 1 - 1 -n 2 — ly 

P{q. \6H)CC 

^ ^ [ (%-i)K-i) j’^’' K + »2)<^ .;,^p[,[ (”l — i)gl+(^2-i)Z2 j ^ 

I27r(ni-f7i2— 1)/ Ui \ TCi+nj— 1 ) 


2imi 


2(^1 -t- Tig— 1) 


sech^/^Zi sech’^Zg, 


( 12 ) 

( 13 ) 


X cosh’/’Z]; cosh’^'iZg exp 


(»! — ^)(Tlg— ^)(Zl — Zg)' 
2(Tli-l-Tl2— 1) 


)■ 


( 14 ) 



268 SIGNIFICANCE TESTS; ONE NEW PARAMETER Chep. V 

A little simplification is possible if we remember that a test will 
be needed only if n-^ and are both rather large, and then the critical 
value will be for a rather small value of z^— We can therefore 
introduce a mean value given by 


(%+»l 

j-l)z= (»1 — J)Zi+(» 2 — ^) 22 ; 

(16) 

and, nearly. 

p — tanhz 


(16) 


/ 2(ni-j-ft2) 

/| 1 <«i M 

(17) 

and K+»*' 2 )c _ 

1 



Til 

II, 

(i 1 >«i-M 

(18) 


WiTlj 




(%+W2)c h2-.cxn[ 

K — i)(W2-^)(Zl- 

■Z2)n 

\27T(w^-f«jj— 1)1 

vA/Dli <6 IJ 1 

ni ( 

2(711 + ^2—1) 

— J- 


(19) 


A further and permissible approximation will be got by identifying 
and Wj and — i in the outside factors; we can take these as 


f2{na-i)(ni+n2- 1)1^2 




r 


( 20 ) 


2Tr(ni~i)(n2—i) 


j’'V-|pl) (ipI 




( 21 ) 


The tests given above for normal correlation can be adapted imme- 
diately to rank correlation. It would be necessary only to calculate 


1 -0472(1 + 0-042p2+o-008pHO-002p«) 

for the estimated p. Then in the outside factor of (14) we should divide 
by this expression, and in the exponent we should divide by its square, 
in accordance with the form of the approximation 6.0(10). The 
correction is small enough for the effect of error in it to be regarded 
as negligible. 


5.6. The intraclass correlation coefficient. This arises when we 
have a number of classes of k members each. If there is a component 
variation common to all members of a class, with standard error t, about 
some general value, and superposed on it is a variation with standard 
error a', the ratio of the two can be estimated from the ratio of the 
variation between the class means to the variation within the classes. 



§6.6 SIGNIFICANOE TESTS; ONE NEW PARAMETEB 269 

In the case k — 2, the expectation of the squared difference between 
members of the same pair is 2a'*, that between members of different 
pairs 2(a'*+T*) = 2a*. By analogy with the simple correlation coeflBcient 
we may introduce a correlation p, and if x and y are members of the 
same pair and E denotes expectations given the parameters, 

E{x-yf = E(x^)+E{y^)~2E{xy) 

= 2(l-p)a* 

and also = 2a'*. 

Hence p = T*/a*. (1) 

The last relation provides a definition of p even if there are many 
members in each class. For if there were k in each group, a and t 
retain their meaning in terms of expectations, and it would still be a 
valid procedure to pick out two members at random from each group, 
and for these the same argument will hold. Thus we can always define 
p as meaning irrespective of the number of groups and of the 

number of observations per group. In terms of this definition p cannot 
be negative. 

Brunt, t following Kapteyn, analyses the meaning of the correlation 
coefficient in general by regarding m as the number of component 
disturbances common to x and y, while n are independent. The correla- 
tion p would then be equal to and could be interpreted as 

a ratio capable of being estimated by sampling, with its prior proba- 
bility uniformly distributed from 0 to 1. This appears to be a valid 
analysis of the intraclass correlation. Thus in the correlation of height 
between brothers it may be supposed that there is an inherited part 
common to both, on which random variations due to segregation are 
superposed. Negative values are excluded on such an analysis; to 
include them we need the extended analysis given in 2.5. But there 
seem to be many cases where this kind of analysis is valid, and there 
is a close analogy between the ordinary and intraclass correlation 
coefficients. 

The conditions contemplated in the hypotheses of intraclass correla- 
tion arise in two types of case. One is illustrated by the comparison 
of brothers just mentioned, where members of different families may 
be expected to differ, on the whole, more widely than members of the 
same family. In agricultural tests on productivity different specimens 


t Combination of Obscrvaliona, 1931, p. 171. 



270 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

are expected to differ more if they belong to different varieties than to 
the same variety. In these cases the comparison is a method of positive 
discovery, though in practice the existence of intraclass correlation is 
usually so well established already by examination of similar cases that 
the problem is practically one of estimation. In physics the problem is, 
perhaps, more often one of detecting unforeseen disturbances. Groups 
of observations made in the same way may peld independent estimates 
of a parameter, with uncertainties determined from their internal 
consistency; but when the separate estimates are compared they may 
differ by more than would be expected if these uncertainties are 
genuine. Sometimes such discrepancies lead to new discoveries; more 
often they only serve as a warning that the apparent accuracies are not 
to be trusted. Doubts are often expressed about the legitimacy of 
combining large numbers of observations and asserting that the uncer- 
tainty of the mean is times that of one observation. This state- 
ment is conditional on the hypothesis that the errors follow a normal 
law and are aU independent. If they are not independent, further 
examination is needed before we can say what the uncertainty of the 
mean is. The usual physical practice is to distinguish between ‘acci- 
dental’ errors, which are reduced according to the usual rule when many 
observations are combined, and ‘systematic’ errors, which appear in 
every observation and persist in the mean. Since some systematic errors 
are harmonic and other variations, which are not constant, but either 
are predictable or may become so, an extended definition is desirable. 
We shall say that a systematic error is a quantity associated with an 
observation, which, if its value was accurately known for one observation, 
would be calculable for all others. But even with this extended meaning 
of ‘systematic error’ there are many errors that are neither accidental 
nor systematic in the senses stated. Personal errors of observation are 
often among them. It is known that two observers of star transits, for 
instance, will usually differ in their estimates, one systematically record- 
ing the transit earlier or later than the other. Such a difference is called 
the personal equation. If it was constant it would come within the 
definition of systematic error, and is usually treated as such; it is 
determined by comparing with a standard observer or with an automatic 
recording machine, and afterwards subtracted from all readings made 
by the observer. Karl Pearsonf carried out some elaborate experiments 
to test whether errors of observation could be treated in this way, as 
a combination of a random error with a constant systematic error for 
t PhU. Trans. A, 198 , 1902, 236-99. 



§6.6 SIGNIFICANCE TESTS: ONE NEW PARAMETEB 271 

each observer. The conditions of the experiments were designed so as 
to imitate those that occur in actual astronomical observations. One 
type consisted of the bisection of a line by eye, the accuracy being 
afterwards checked by measurement. The other was essentially observa- 
tion of the time of an event, the recorded time being compared with 
an automatic record of the event itself. The conditions resembled, 
respectively, those in the determination of the declination and the time 
of transit of a star with the transit circle. For each type of observation 
there were three observers, who each made about 500 observations. 
When the observations were taken in groups of 25 to 30 it was found 
that the means fluctuated, not by the amounts that would correspond 
to the means of 25 to 30 random errors with the general standard error 
indicated by the whole series, but by as much as the means of 2 to 15 
independent observations should. The analysis of the variation of the 
observations into a constant systematic error and a random error is 
therefore grossly insufficient. The non-random error was not constant 
but reversed its sign at irregular intervals. It would resemble the kind 
of curve that would be obtained if numbers — 5 to -)- 5, repetitions being 
allowed, were assigned at random at equal intervals of an argument and 
a polynomial found by interpolation between them. There is an element 
of randomness, but the mere continuity of the fimction implies a correla- 
tion between neighbouring interpolated values. 

I shall speak of internal correlation as including intraclass correlation 
and also correlations similar to those just described. 

Internal correlation habitually produces such large departures from 
the usual rule that the standard error of the mean is n~^^'‘ times that of 
one observation that the rule should never be definitely adopted until 
it has been checked. In a series of observations made by the same 
observer, and arranged m order of time, internal correlation is the 
normal thing, and at the present stage of knowledge hardly needs a 
significance test any longer. It practically reduces to a problem of 
estimation. The question of significance arises only when special 
measures have been taken to eliminate the correlation and we want to 
know whether they have been successful. Thus ‘Student’ writes :f 
‘After considerable experience, I have not encountered anj'^ determina- 
tion which is not influenced by the date on which it is made; from this 
it follows that a number of determinations of the same thing made on 
the same day are likely to lie more closely together than if the repeti- 
tions had been made on different days. It also follows that if the 
t Quoted by E. S. Pearson, Biometrika, 30, 1939, 228. 



272 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

probable error is calculated from a number of observations made close 
together in point of time, much of the secular error will be left out and 
for general use the probable error will be too small. Where, then, the 
materials are sufficiently stable, it is well to run a number of deter- 
minations on the same material through any series of routine determina- 
tions which have to be made, spreading them over the whole period.’ 
He is speaking of physical and chemical determinations. In astronomy 
an enormous reduction of uncertainty, by factors of 10 or 100, is 
achieved by combining large numbers of observations. But astronomers 
know by experience that they must be on the look-out for what they 
call systematic errors, though many of them would come under what 
I call internal correlation. They arrange the work so that star-positions 
are compared with other stars on the same plate, so that any tendency 
to read too high or to one side will cancel from the differences, even 
though it might be reversed on the next plate measured; the scale of 
the plate is determined separately for each plate by means of the com- 
parison stars; special care is taken to combine observations in such 
a way that possible errors with daily or annual periods will not con- 
tribute systematically to the quantity to be determined; as far as 
possible observers are not aware what sign a systematic effect sought 
would have on a particular plate; and so on. In seismology many of 
the great advances of the past have been made by ‘special studies’, in 
which one observer collects the whole of the records of an earthquake, 
reads them himself, and publishes the summaries. There is here a 
definite risk of some personal peculiarity of the observer appearing in 
every observation and leading to a spurious appearance of accuracy. 
BuUen and I dealt with this, in the first place, by using the readings 
made at the stations themselves; thus any personal peculiarity would 
affect only one observation for each phase for each earthquake, and 
the resulting differences would contribute independently and could be 
treated as random. In the design of agricultural experiments Fisher 
and his followers are in the habit of eliminating some systematic ground 
effects as accurately as possible; the rest would not necessarily be 
random, but are deliberately made to contribute at random to the 
estimates of the effects actually sought, by randomizing the design as 
far as is possible consistently with the normal equations for the main 
effects being orthogonal. 

As a specimen of the kind of results obtainable with such precautions 
we may take the comparisons of the times of the P wave in European 
and North American earthquakes, for distances from 22-6° to 67-5'’; 



S 6.6 SIGNIFICANCE TESTS ; ONE NEW PARAMETER 273 

mean residuals are given against a trial table. Unit weight means a 
standard error of 1 sec. 


A 

Europe 

N. America | 

Difference 



Mean 

WeigfU 

Mean 

Weight 

Weight 

X * 

22-6 

- 0-2 

4-7 

- l-io 

0-6 

4 - 0-8 

0-6 

0-3 

23-6 

- 0-8 

6-3 

- 0-1 

0-6 

4 - 0-3 

0-5 

0-0 

24-6 

-M 

31 

-f 1-0 

0-5 

4 - 1-7 

0-4 

1-2 

26-6 

- 0-7 

31 

- 0-2 

0-9 

4 - 0-1 

0-7 

0-0 

26-6 

- fO -3 

2-7 

- 1 - 0-1 

1-0 

- 0-6 

0-7 

0-3 

27-6 

-10 

0-8 

- 1 - 0-3 

1-2 

4 - 0-9 

0-6 

0-4 

29-0 

- 0-6 

4-5 

- 1 - 0-3 

2-0 

4 - 0-5 

1-4 

0-4 

31-6 

- 0-2 

6-3 

- 1 - 0-7 

2-6 

4 - 0-5 

1-7 

0-4 

34-6 

- 1-8 

3-1 

- 0-6 

2-8 

4 - 0-8 

1-5 

1-0 

37-6 

- 0-8 

1-8 

4 - 0-8 

2-1 

4 - 1-2 

1-0 

1-4 

40-6 

- t - 0-9 

11 

- 0-5 

1-3 

- 1-8 

0-6 

2-0 

43-5 

- 0-7 

1-9 

- 1-4 

0-8 

-M 

0-6 

0-7 

46-5 

- 1-2 

30 

- 1-5 

1-0 

- 0-7 

0-8 

0-4 

49-6 

- 1-8 

1-6 

- 1-4 

0-8 

0-0 

0-5 

0-0 

62-6 

-10 

2-6 

- 2-8 

1-0 

- 2-2 

0-7 

3-4 

66-6 

- 0-7 

1-9 

- 2-5 

1-1 

- 2-2 

0-7 

3-4 

58-6 

-10 

1-2 

- 1-4 

0-3 

- 0-8 

0-3 

0-2 

62-6 

- 1'2 

14 

- 0-9 

; 2-5 

- 0-1 

0-9 

0-1 

67-6 

- 1-3 

1-2 

- 0-8 


4 - 0-1 

0-9 

0-1 





j 3-3 

i 


j 15-7 


A constant systematic difference is to be expected, corresponding to 
a slight difference in the way of estimating the origin times, arising 
from the fact that the distributions of weight outside this range are 
very different. The weighted mean of the difference is -|-0-4s.ih0‘3s. 
This is added to the European mean and the result subtracted from 
the North American one. The results are given as ‘difference’, with the 
corresponding weights. Then 

X* = 2 (weight) (difference)® =15-7 

on 19 entries, from which one parameter has been determmed, so that 
the expectation of is 18 on the hypothesis of randomness. 

The distribution of signs at first sight suggests a systematic varia- 
tion, but we notice that up to 31-5° the whole weight of the 8 differences 
is 6-4, and the weighted mean -t-0-45±0'40, which is not impressive. 
The last five give — 0-91±0’58. The magnitude of the differences is, 
in fact, imusually small in the early part of the table, as we see from 
the fact that the largest contribution to x* is 1’2. There is no contribu- 
tion larger than 3-4, but on 19 entries we should have been prepared 
to find one greater than 4-0 on the hypothesis of randomness. 

5.61. Systematic errors: further discussion. For simplicity we 

StS6.t8 q. 





274 SIGNIFICANCE TESTS; ONE NEW PARAMETEE Chap. V 

may take the very common case where the systematic error is an addi- 
tive constant. Now what can such a systematic error mean in terms 
of our theory ? The true value, for our purposes, has been identified 
with the location parameter of the law of error, and the best estimate 
of this is definitely the mean. If, subject to it, the errors are independent, 
its uncertainty is correctly given by the usual formula, and we have 
seen how to correct it if they are not. Systematic error has a meaning 
only if we understand by the true value something different from, the loca- 
tion parameter. It is therefore an additional parameter, and requires a 
significance test for its assertion. There is no epistemological difference 
between the Smith effect and Smith’s systematic error; the difference is 
that Smith is pleased to find the former, while he may be annoyed at 
the discovery of the latter. Now with a proper understanding of induc- 
tion there is no need for annoyance. It is fully recognized that laws 
are not final statements and that inductive inferences are not certain. 
The systematic error may be a source of considerable interest to his 
friend Smythe, an experimental psychologist. The important thing is 
to present the results so that they will be of tlie maximum use. This 
is done by asserting no more adjustable parameters than are supported 
by the data, and the best thing for Smith to do is to give his location 
parameter with its uncertainty as found from his observations. The 
number of observations should be given explicitly. It is not sufficient 
merely to give the standard error, because we can never guarantee 
absolutely that the results will never be used in a significance test, and 
the outside factor depends on the number of observations. Two esti- 
mates may both be -|-T50±0-50, but if one is based on 10 observations 
with a standard error of T5 and the other on 90,001 with a standard 
error of 150, they will give respectively K = 0-34 and K = 4-3 in a 
test of whether the parameter is zero. Now this difference does not 
correspond to statistical practice, but it does correspond to a feeling 
that physicists express in some such terms as ‘it is merely a statistical 
result and has no correspondence with physical reahty’. The former 
result would rest on about 8 observations with positive signs, and 2 
with negative, an obvious preponderance, which would give K = 0-49 
when tested against an even chance. The latter would rest on nearly 
equal numbers of observations with positive and negative signs. I think 
that the physicist’s feeling in this is entitled to respect, and that the 
difference in the values of K gives it a quantitative interpretation. 
The mean of a large number of rough observations may have the same 
value and the same standard error as that of a smaller number of 



§ 6.6 SIGNIFICANCE TESTS: ONE NEW PARAMETER 275 

accurate observations, and provided that the independence of the errors 
is adequately checked it is equally useful in an estimation problem; 
but it provides much less ground for rejecting a suggestion that the 
new parameter under discussion is zero when there is such a suggestion. 
Ultimately the reason is that the estimate is a selection from a wider 
range of possible values consistent with the whole variation of the 
observations from 0, and the difference in the values of K represents 
the allowance for this selection. 

Now systematic differences between experiments with different 
methods, and even between different experimenters apparently using 
the same method, do exist. It is perfectly possible that what Smith 
does measure is something different from what he sets out to measure, 
and the difference is his systematic error. The quantity to be estimated 
may indeed be different in kind from the one actually measured. A 
meteorologist wants to know the atmospheric pressure, but what he 
observes is the height of a column of mercury. The conversion requires 
the use of a hydrostatic law, which is not questioned, but it involves 
the local value of gravity and the temperature, which enters through the 
density of the mercury. Allowing for the differences between these 
and some standard values is the removal of a calculable, and therefore 
a systematic, error. An astronomer wants the direction of a star, as 
seen from the centre of the earth ; but the observed direction is affected 
by refraction, and the latter is calculated and allowed for. The only 
increase of the uncertainty involved in applying such a correction 
represents the uncertainty of the correction itself, which is often 
negligible and can in any case be found. 

The problem that remains is. how should we deal with possible 
systematic errors that are ?iot yet established and whose values are 
unknown ? A method often adopted is to state possible limits to the 
systematic error and combine this with the apparent uncertainty. If 
the estimate is ad:®, arid systematic error may be between 
(usually greater than s), the observer may reckon the latter as corre- 
sponding to a standard error of m/\3 and quote his uncertainty as 
or with a still more drastic treatment he may give it 
as Either treatment seems to be definitely undesirable. If 

the existence of the error is not yet established it remains possible that 
it is absent, and then the original estimate is right. If it exists, the 
evidence for its existence will involve an estimate of its actual amount, 
and then it should be allowed for; and the uncertainty of the corrected 
estimate will be the resultant of s and the determined uncertainty of 



276 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

the systematic correction. In either case s has a useful function to 
serve, and should be stated separately and not confused with m. The 
possible usefulness of m, where the existence of the error is not estab- 
lished and its actual amount therefore unknown, is that it suggests 
a possible range of values for a new parameter, which may be useful 
in comparison with other series of observations when material becomes 
available to test the presence of a systematic difference. But inspection 
of our general approximate formula shows that the statement of m will 
go into the outside factor, not into the standard error. If the standard 
error is inflated by m the result will be to increase the uncertainty 
unjustifiably if the suggested difference is not revealed by the accurate 
test; and to fail to reveal a difference at all when the test should show 
it and lead to an estimate of its amount. In either case the inclusion 
of m in the imcertainty leads to the sacrifice of information contained 
in the observations that would be necessary to further progress (cf. 5.63). 
A separate statement of the possible range of the systematic error may 
be useful if there is any way of arriving at one, but it must be a separate 
statement and not used to increase the uncertainty provided by the 
consistency of the observations themselves, which has a value for the 
future in any case. In induction there is no harm in being occasionally 
wrong; it is inevitable that we shall be. But there is harm in stating 
results in such a form that they do not represent the evidence available 
at the time when they are stated, or make it impossible for future 
workers to make the best use of that evidence. 

5.62. Estimation of intraclass correlation. In most treatments 
of this problem, including the one in the first edition of this book, the 
classes compared have been supposed equal in number. In such cases 
K can be reduced to a single integral. This condition is satisfied in 
balanced designs, such as are often used in biological experiments. In 
other applications it is rarely satisfied. However carefully an astronomer 
designs his observing programme it will generally be interrupted by 
cloud. Even in the comparison of brothers there is no theoretical reason 
for taking the same number from every family; the reason is only to 
make the analysis fairly easy. But it is usual for the scatter within the 
groups to give an estimate of the random error sufficiently accurate to 
be taken as a definite determination of a. We suppose then that there 
is a general location parameter A; that there are m groups of observa- 
tions, the number in the rth group being k,., and that there is a location 
parameter \ associated with the group whose probability distribution 
about A is normal with standard error r; and that within each group 



SIGNIFICANCE TESTS: ONE NEW PARAMETER 


277 


§ 5.6 


the observed values are random with standard error a about A^. The 
uncertainty of o is taken as negligible. We suppose the separate values 
\—X, given t, to be independent. This is the fundamental distinction 
between intraclass correlation and systematic variation. The data are 
the group means x,. According to the hypotheses 

== X±^{T^+a^/k,) (1) 

and the likelihood is 

L = {2n)-y^^ n (r^+<TV^v)-''“exp[_i ^ FI 

Then we have to estimate A and t. We have 


8 


log 2/ = 


2 


^r(^r — A) 


A log z, - i y + i y 


(3) 

(4) 


Putting these zero we have the maximum likelihood equations for A 
and T^. To get the uncertainties we need also the second derivatives 


8^ , j _ 's' 

^AjlogL- 


K 




kl 


( 8 - 


{o^+k,T^r 


(5) 

(6) 


The posterior probability distribution of A will not reduce to a simple 
t rule. If T was 0 it would be normal with standard error kr)~y^a. 
If a was 0 it would follow a t rule with m— 1 degrees of freedom. We 
are concerned with intermediate cases, and may expect that the 
distribution will resemble a t rule with more than in — 1 degrees of 
freedom. To estimate the number we form the corresponding deriva- 
tives for the normal law. Here we have 


logL= — Tiloga — ^{(x— A)2+s'2}, 

^^logL = 5(x-A), 

--logL = _ ” (i:-A)2}, 


0* 

^2 


logi = 


n 


a 


2’ 


02 

(0a2)2 


logL = 

® 2cr* 


_^j5'2-f(x-A)2}. 


(7) 

( 8 ) 

(9) 

( 10 ) 

( 11 ) 



278 SIGNIFICANCE TESTS: OJJTE NEW PARAMETER Chap. V 

(8) and (9) vanish when A = x, </ — s'; and then (10) becomes —njs'^ 
and (11) becomes — n/2s'*. Hence, to the second order in departures 
from the maximum likelihood solution. 


logL = constant— ~(x—A)2—~(ff*-s'2)2. 


But it is simply the uncertainty of a that produces the departure of the 
t rule from the normal. Consider then the value —A taken by (10) 
when cr^ = s'^, and the value —B taken when 


We have 




_ 2 
{A iB-~\r 


Then the number of degrees of freedom is n— 1 ; and the of the t rule 
is given by ,2 


_ 


n 


n- 


-1 (n-l).4' 


This can be immediately adapted to (5) and (6). We w’ork out (6) for 
the maximum likelihood solution. (5) for this solution is —A; (5) vith 
increased by its standard error indicated by (6) is —B. An approxi- 
mate i rule for A follows. 

The following data on the correction to the constant of nutation, 
derived from a combination of data by Sir H. Spencer Jones, | provide 
an illustration. The separate equations of condition are from compari- 
sons of different pairs of stars. The unit is taken as 0-01"; the standard 
error for unit weight derived from internal comparisons is 7-7. The 
weights have been rounded to the nearest unit. 


K 

i ' 

kr(x,-x,)‘ 

44 

-2-02 

326 

25 

-I-3-62 

200 

23 

-i-4-17 

293 

25 

-i-011 

8 

8 

-1-73 

47 

6 

+ 4-89 

90 

3 

-1-4-28 

1 39 

18 

-0-82 

41 



1043 


The weighted mean is -fO'69 and gives = 1043/7-7^ — 16-9 on 7 
degrees of freedom. This is beyond the 2 per cent, point, and is enough 
to arouse suspicion. The original series, before they were combined to 

t M.N.R.A.S. 98, 1938, 440-7. 



§6.8 SIGNIFICANCE TESTS: ONE NEW PARAMETER 279 

give the above estimates, had shown similar discrepancies, one of them 
being beyond the 0-1 per cent, point. There is further confirmation 
from the distribution of the contributions to y®. For random variation 
these should not be correlated with k^. Actually the three largest 
contributions come from three of the four largest k^, which is what we 
should expect if intraclass correlation is present. We therefore proceed 
to estimate t^. 

To get an upper estimate we treat all the values as of equal weight, 
thus neglecting a^. The simple mean is + 1-55 — which is a warning that 
if T is not taken into account there may be a serious error in the estima- 
tion of A — and the residuals give = 8-8. This is too high since the 
variation includes the part due to a. 

We write = k^j(a~-\-k^T^). 

For purposes of computation A is taken as A^ = -|-1-13 (suggested by 
the first trial value = 6-0), and is worked out for several trial 
values of r^. Results are as follows. 



By (4) we have to interpolate so that ^ u\— 2 We 

can neglect the difference between A and Ap. Interpolation gives 


= 3-71, 


and the interpolated value of A— A^ is —0-075, hence 


Also 



A = +1-055, 

2 


— == 2«'r+2«r(^r-'^)^ = +0-091. 

Then we can take t* = 3-71 + 3-32. Substitute in 2 for = 3-7 1 and 
6-0; we get respectively +1-02 and 0-77. Extrapolating to = 7-03 
we have 2 ~ 0-66, 

2 7 

(1-02/0-66—1)2 • ’ 6x1-02 

Changing the unit to 1" we have the solution 
A = +0-0105'±0-0107', 

T = 0-0193"±0-0073'. 


6 d.f.. 






280 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

This solution is given only as an illustration of the method. A discussion 
using expectations gave similar conclusions, f but led Spencer Jones to 
go more into detail. He discovered a systematic effect that had been 
overlooked, and on allowing for it he obtained a satisfactory agreement 
with the hypothesis of independence of the errors, and consequently a 
substantial increase in accuracy. J His result was 

A = 4-0-0034''±0-0062". 

The question of a significance test for t will arise in such problems. 
We notice that on the hypothesis t = 0 a mean has a standard error 
cr/VA;,., and for other t one of Hence, for small t*, J will be 

of the order of magnitude of t*, not t. In applying the approximate 
form for K we should therefore take 

K 4= (Y)"“exp{-ir74}. 

as suggested by 5.31(5); a factor ^ is needed because cannot be 
negative. 

The determination of the constant of gravitation provides an illustra- 
tion of the danger of drastic rejection of observations and of the method 
of combining estimates when the variation is not wholly random. 
C. V. Boys gave the value 6-658 x 10“* c.g.s. But P. R. Heyl,§ quoting 
Boys’s separate values, points out that the simple mean, apart from 
the factor 10~®, is 6-663. There were nine determinations, of which all 
but two were rejected, so that the final result was the mean of only 
two observations with an unknown standard error. Even if these had 
been the only two observations the uncertainty of the standard error 
would have a pronounced effect on the posterior probability distribu- 
tion; but when they are selected out of nine the accuracy is practically 
impossible to assess. Heyl made three sets of determinations, using 
balls of gold, platinum, and optical glass respectively. The summaries 
are as follows, with the estimated standard errors. 

n 

Boys .... 6-663±0-0023 . . .9 

Heyl 

Gold .... 6-678i; 0 0016 . . .6 

Platinum . . . 6-664 ^O-OOIS . . .6 

Glass .... 6-674 ±00027 . . .5 

The estimates are plainly discrepant. Heyl has tested the possibility 
of a real difference between the constant of gravitation for different 


t M.N.R.A.S. 99, 1939, 206-10. 

I Bur. Standards Res. J. 5, 1930, 1243-90. 


J Ibid., pp. 211-16. 



§ 6.6 


SIGNIFICANCE TESTS: ONE NEW PARAMETER 


281 


substances by means of the Eotvbs balance and finds none ; and there 
is no apparent explanation of the differences. They are so large that 
we may compute the simple mean at once; it is 6-670, and the sum of 
squares of the residuals is 165 x 10-*, of which the known uncertainties 
account for 17 x 10~®. The standard error of an entire series can then 


be taken as 10“* = 0-0070. Combining this with the known 

uncertainties we get for the respective 10-®(54, 52, 51, 56). An im- 
proved value could be got by computing a revised mean with the 
reciprocals of these as weights, but they are so nearly equal that the 
simple mean will be reproduced. The standard error can then be 


taken as 



'h 


3-7x 10-®, 


and the result is 10-*(6-670± 0-0037). The result is, however, virtually 
based on only three degrees of freedom; the root-mean-square estimate 


of uncertainty would be 



6-4x10-*, 


and this would be the safest to use in matters where the chief uncertainty 
arises from the constant of gravitation. 

5.63. Suspiciously close agreement. The tendency of either inter- 
nal correlation or of a neglected systematic effect is in general to increase 
y® or 2 , and it is chiefly to this fact that these functions owe their 
importance. If they agree reasonably with their expectations the null 
hypothesis can usually be accepted without further ado. But it some- 
times happens that is much less than its expectation ; an analogous 
result would be strongly negative z when the variation suspected of 
containing a systematic part is compared with the estimate of error; 
another is when the standard error of a series of measures is much less 
than known sources of uncertainty suggest. Strong opinions are ex- 
pressed on this sort of agreement. Thus Yule and Kendall remark ;f 

‘Nor do only small values of P (the probability of getting a larger by accident) 
lead us to suspect our hypothesis or our sampling technique. A value of P very 
near to unity may also do so. This rather surprising result arises in this way: a 
large value of P normally corresponds to a small value of that is to say a very 
close agreement between theory and fact. Now such agreements are rare — • 
almost as rare as great divergences. Wo are just as unlikely to get very good 
correspondence between fact and theory as we are to get very bad correspondence 
and, for precisely the same reasons, wo must suspect our sampling technique if 
we do. In short, very close agreement is too good to be true. 


t Introduction to the Theory of Statistics, p. 423. 



282 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


‘The student who feels some hesitation about this statement may like to 
reassure himself with the following example. An investigator says that he threw 
a die 600 times and got exactly 100 of each number from 1 to 6. This is the 
theoretical expectation, = OandP = 1, but should wo believe him ? We might, 
if we knew him very well, but we should probably regard him as somewhat lucky, 
which is only another way of saying that he has brought off a very improbable 
event.’t 

Similajly, Fisher writes 

‘ If P is between 0- 1 and 0-9 there is certainly no need to suspect tho hypothesis 
tested. . . .’ 

‘ The term Goodness of Fit has caused some to fall into the fallacy of believing 
that the higher the value of P the more satisfactorily is the hypothesis verified. 
Values over 0-999 have been reported, which, if the hypothnsi.s were true, would 
only occur once in a thousand trials. Generally such case.s are demonstrably due 
to the use of inaccurtite formulae, but occasionally small values of beyond the 
expected range do occur, ... In these cases the hypothesis is as definitely dis- 
proved as if P had boon 0-001.’ 

A striking case is given by Fisher§ himself in a discussion of the 
data in Mendel’s classical papers on inheritance. In every case the 
data agreed with the theoretical ratios within less than the standard 
errors; taking the whole together, was 41-6 on 84 degrees of freedom, 
and the chance of a smaller value arising accidentally is 0-00007. 
The test originated in two cases where Mendel had distinguished the 
pure and heterozygous dominants by self-fertilization, growing ten of 
the next generation from each. Since the chance of a self-fertilized 
heterozygote giving a dominant is f , the chance that all ton would be 
dominants is (0-75)^® = 0-05, so that about 5 per cent, of the hetero- 
zygous ones would fail to be detected, and the numbers would be 
underestimated. Correcting for tliis, Fisher found that Mendel’s ob- 
served numbers agreed too closely with the uncorrected ratio of one 
pure to two mixed dominants, while they showed a serious discrepancy 
from the corrected ratio. Fisher suggests that an enthusiastic assistant, 
knowing only too well what Mendel expected, made the numbers agree 
with his expectations more closely than they need, even in a case where 
Mendel had overlooked a complication that would lead the theoretical 
ratio to differ appreciably from the simple 1 : 2. 

When there is only one degree of freedom to be tested a very close 
agreement is not remarkable — if two sets of measures refer to the same 
thing, agreement between the estimates within the rounding-off error 

t To go to the other extreme, if a man reports that he obtained a complete hand of 
one suit at bridge we do not believe that he did so by a random deal. It is more likely 
either that he is lying or that something was wrong with the shuffling. 

J Statistical Methods, 1936, p. 84. § Annals of Science 1, 1936, 115-37. 



§6.6 SIGNIFICANCE TESTS: ONE NEW PARAMETER 283 

is the most probable result, even though its probability is of the order 
of the ratio of the rounding-off error to the standard error of the 
difference. It is only when such agreements are found persistently that 
there is ground for suspicion. The probable values of from 84 degrees 
of freedom are 84^^13, not 0. If the only variations from the null 
hypothesis were of the types we have discussed here, too small a 
would always be evidence against them. Unfortunately there is another 
type. By some tendency to naive notions of causality, apparent dis- 
crepancies from theory are readily reduced in the presentation of 
the data. People not trained in statistical methods tend to under- 
estimate the departures that can occur by chance, a purely random 
result is in consequence often accepted as systematic when no signi- 
ficance test would accept it as such, and ‘effects’ make transitory 
appearances in the scientific journals until other workers repeat the 
experiments or estimate the uncertainty properly. Similarly, when 
the investigator believes in a theory he is predisposed to think that if 
a set of observations differs appreciably from expectation there is some- 
thing wrong with the obser^'ations, even though a closer examination 
would show that the difference is no larger than would often occur by 
chance; and the consequence is that observations may be rejected or 
illegitimately modified before presentation. Tliis tendency is the more 
dangerous because it may be completely unconscious. In Mendel’s 
experiments, where there were theoretical ratios to serve as a standard, 
the result would be too small a which is what Fisher found. 

A significance test for such cases on the Unes of the present chapter 
has not been constructed. It would be most useful if the prior proba- 
bility took account of previous information on human mendacity, but 
this has not, I tliink, been collected in a useful form! 

5.64. Sir Arthur Eddington has claimed to have deduced theoretica 
values of many measxirable physical quantities from purely epistemo- 
logical considerations. I consider that this is at least partly because he 
has incorporated a great deal of observational material into what he 
calls epistemology ;t but that is not the chief reason why the great 
majority of physicists hesitate to accept his arguments. At any rate it 
is interesting to compare the values deduced theoretically in his Funda- 
mental Thexyry with observation. He takes the velocity of light, the 
Rydberg constant, and the Faraday constant as fundamental and 
calculates the rest from them. I give his comparisons as they stand 
except for powers of 10, which are irrelevant for the present purpose; 
t Fhil. Mag. (7), 32, 1941. 177-205. 



2S4 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


uncertainties are given as 'probable errors’ and the factor {0-6746)® 
must be applied at some stage in the computation of x^- Probable errors 
are given for the last figure in the observed value. 



Obs. 

P.E. 

Calc. 

O.-C. 

(0-6746)-‘x* 

ejm^c (deflexion) 

1-75959 

24 

1-75963 

+ 6 

0-1 

e/m,c (spectroscopic) . 

1-75934 

28 

1-75953 

-19 

0-6 

Ac/27re» . 

137-009 

16 

137-000 

+ 9 

0-3 

m,/m^ 

1836-27 

56 

1836-34 

-7 

0-0 

M ... . 

1-67339 

31 

1-67368 

-29 

0-9 

m, . 

9-1066 

22 

9-1092 

-26 

1-4 

t’ . 

4-8025 

10 

4-8033 

-8 

0-6 

h' . 

6-6242 

24 

6-6260 

-8 

0-1 

file' .... 

1-3800 

5 

1-3797 

+ 3 

0-4 

K . 

6-670 

6 

6-6665 

+ 3-6 

0-6 

n'-H' 

0-00082 

3 

0-0008236 

-0-4 

0-0 

2H'-D' . 

0-001539 

2 

0-0015404 

-1-4 

0-6 

4H-He . 

0-02866 

7 

0-32862±4 

+ 4 

<1-0 

SW . . . . 

2-7896 

8 

2-7899 

-3 

0-1 

3Jl. . 

1-935 

20 

1-9371 

-2-1 

0-0 

<6-4 


I have omitted some of Eddington’s comparisons but retained, I think, 
all where the observed values rest on independent experiments. The 
result is that x^ is not more than 2-9, on 15 d.f. This is preposterous; 
the 99 per cent, point is at x^ = 5-2. 

It might theoretically be better not to take three constants as 
definitely known, but to make a least-squares solution from 18 data, 
taking these as unknown, using their experimental uncertainties. This 
would not make much difference since they are among those whose 
uncertainties are smallest compared with the adopted values; the only 
difference would be that x^ would be slightly reduced, remaining on 
15 d.f. 

Many of the observed values are based on very few degrees of freedom ; 
K, the constant of gravitation, for instance, is on 3 d.f. In these condi- 
tions the use of x^ as if the errors were normally distributed is seriously 
wrong (cf. 2.82); but the tendency of the allowance for small numbers 
of degrees of freedom would be to increase the expectation of x^, and 
a more accurate test would give a larger predicted x^- Thus correction 
of either of the obvious statistical blemishes would increase the dis- 
crepancy; and the observations agree with Eddington’s theory far better 
than they have any business to do if that theory is right. 

There are two possible explanations. The one that would occur to 
many physicists is that Eddington’s theory is artificial throughout, 
and that by skilful juggling with numbers he has produced a forced 






|e.« SIGNIFICANCE TESTS; ONE NEW PARAMETER 285 

agreement. This may be so, though I should not say that his theory is 
at any point more artificial or less intelligible than any other statement 
of quantum theory. All need a complete restatement of their relations 
to experience, including a statement of what features in experience 
demand the kind of analysis that has been adopted. 

The other concerns the ‘probable errors’ of the observed values. 
Many of these are not based on a statistical discussion, but include 
an allowance for possible systematic errors, of the kind that is depre- 
cated in 6.61. It is quite possible that the probable errors given are 
systematically two or three times what a proper statistical discussion 
would give. In particular, some of the estimates are the results of 
combining several different determinations, alleged to be discrepant, 
but as the number of degrees of freedom of the separate determinations 
is never given, it is impossible to form a judgement on the existence of 
these discrepancies without working through the whole of the original 
data afresh. If the uncertainties had not been artificially inflated it is 
possible that a normal would have been found. At any rate the first 
suggested explanation cannot be accepted until the second is excluded 
by a rediscussion of the experimental data. 

5.65. In counting experiments the standard error is fixed by the 
nximbers of the counts alone, subject to the condition of independence. 
In measurement the matter is more complicated, since observers like 
their standard error to be small, and it is one of the unknowns of the 
problem and has to be judged only from the amounts of the residuals. 
But actually the standard error of one observation is not often of much 
further interest in estimation problems; what matters most, both in 
estimation problems and in any significance test that may supervene, 
is the standard error of the estimates. Now it is easy in some types of 
investigation for an apparent reduction of the standard error of one 
observation to be associated with no reduction at aU in the accuracy 
of the estimates. This can be illustrated by the following example. 
A set of dice were thrown, sixes being rejected, and 3 was subtracted 
from each result. Thus a set of numbers —2 to +2, arranged at random, 
was obtained (series A). Differences to order 4 were found, and two 
smoothed sets of values B and C w'ere obtained, one by adding J of 
the second difference, one by subtracting of the fourth difference. 
The unsmoothed and the two smoothed series are shown below. The 
respective sums of the 44 squares, excluding for the series A the two 
unsmoothed values at each end, are 88,| 18-9, and 29-7. The smoothing 

t This agrees exactly with expectation ! 



286 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

has produced a great reduction in the general magnitude of the resi- 
duals; judging by this alone the standard errors have been multiplied 
by 0-46 and 0'68 by the two methods. But actually, if we want a sum- 
mary based on the means of more than about 5 consecutive values we 
have gained no accuracy at all. For if a group of successive entries in 
column A are a;_ 2 , a;_j, Xg, Xj^, x^, method B will make Xq contribute 


A 

B 

0 

A 

B 

C 

A 

B 

C 

0 



4-2 

4 - 1-0 

4 - 0-8 

-2 

- 1-0 

- 1-8 

+ 2 



0 

4 - 1-0 

4 - 1-5 

— 2 

- 1-0 

- 1-7 

-2 

- 0-8 

- 0-8 

4-2 

4 - 0-5 

4 - 0-2 

0 

- 0-5 

- 0-4 

-1 

- 0-5 

- 0-6 

-2 

- 0-2 

- 0-8 

0 

- 0-2 

- 0-2 

+ 2 

4 - 0-2 

fO -3 

-1 

- 0-6 

- 0-7 

-1 

- 0-2 

- 0-1 

-2 

- 1-0 

- 1-1 

4-2 

4 - 0-8 

4 - 0-8 

4-1 

0-0 

- 0-2 

— 2 

- 1-0 

- 1-2 

0 

4 - 0-8 

4 - 1-2 

-1 

0-0 

4 - 0-2 

+ 2 

- 0-6 

4 - 0-7 

4-1 

0-0 

— 0-4 

4-1 

4 - 0-2 

4 - 0-1 

0 

0-0 

4 - 0-2 

-2 

- 0-6 

— 0-4 

0 

4 - 0-2 

4 - 0-5 

-2 

- 1-0 

- 1-2 

4-1 

4 - 0-2 

0-0 

0 

- 0-2 

- 0-2 

0 

- 0-2 

- 0-2 

4-1 

4 - 1-0 

4 - 1-5 

-1 

- 1-0 

- 0-8 

- t-l 

4 - 0-2 

4 - 0-6 

4- 1 

4 - 0-2 

4 - 0-2 

_2 

- 1-0 

- 0-8 

-1 

- 0-8 

- 0-9 

-2 

- 1-0 

- 1-2 

4-1 

- 0-5 

- 0-8 

-2 

- 1-0 

- 1-1 

-1 

- 0-6 

- 0-6 

-2 

- 0-8 

- 0-3 

-hi 

0-0 

- 0-2 

- h 2 

4 - 0-8 

4 - 1-0 

0 



0 

4 - 0-8 

4 - 1-2 

0 

0-0 

4 - 0-2 

-2 




Jxq to the second and fourth entries and to the third; the contribu- 
tion from Xo to the sum of the five remains x^. Method C will make 
Xq contribute —^*0 first and fifth entries, ^Xq to the second and 

foiurth, and ^x^ to the third. Again there is no change in the sum of 
the five. There is a little gain through the contributions from the 
entries for adjacent ranges, but the longer the ranges are the smaller 
this will be. 

Now it might well happen that we have a series of observations of 
what should be a linear function of an independent variable, and that 
the above set of values A are the errors rounded to a unit.| The least- 
squares solution based on the hypothesis of the independence of the 
errors will be valid. If a smoothing process changes the errors to B or 
C the solution will be the same; but if the errors are still supposed 
independent the apparent accuracy will be much too high, because we 
know that the correct uncertainty is given by A . What the smoothing 
as in .6 does, if the error at one value is Xg, independent of adjacent 
values, is to make component errors ^Zg, ^Xg, ^Xg at adjacent values. 
Thus, though the smoothing somewhat improves the individual values, 
it does so by introducing a correlation between consecutive errors; and if 

t The process actually used gets them from a rectanguleir and not a normal distribution 
of chance, but this is irrelevant here. 



§6.6 SIGNIFICANCE TESTS: ONE NEW PARAMETER 287 

the errors are given by £ or C this departure from independence of the 
errors is responsible for a diminished real accuracy in comparison with 
the apparent accuracy obtained on the hypothesis of independence. 

Now at the best the hypothesis of independence of the errors needs 
a check when suitable information becomes available; it is never certain. 
But it does often survive a test, and the estimate of uncertainty is then 
valid. If there is any possibiUty that it is true, that possibility should 
not be sacrificed. There is a real danger in some types of observation 
that spurious accuracy may be obtained by introducing a correlation 
between neighbouring errors. In seismological work, for instance, a 
careful observer may read his records again and again to make ‘sure’, 
working out his residuals after each set of readings; and in these condi- 
tions it is practically impossible for him to avoid letting his readings 
on one record be influenced bj' those at neighbouring distances. There 
is a further danger of accidental close agreement in the results for a 
few separate series ; knowledge of the standard error of each series based 
on the hypothesis of independence prevents too high an accuracy from 
being asserted in such cases.- 

In some cases a lack of independence arising in this way can be 
detected by comparing determinations from different series of observa- 
tions; too large a be found, and then the differences between 

the series provide a valid estimate of uncertainty, though based on 
fewer degrees of freedom than might have been available in the first 
place. But even here it may happen that previous results are used to 
reject observations, and then even this independence fails. If the pos- 
sibility of this check is to be preserved, every series must be reduced 
independently. Otherwise a mistake made at the outset may never be 
found out. 

5.7. Test of the normal law of error. Actual distributions of errors 
of observation usually follow the normal law sufficiently closely to make 
departures from it hard to detect with fewer than about 500 observa- 
tions. Unfortunately this does not show that the treatment appropriate 
to the normal law is appropriate also to the actual law; the same is 
true for a binomial law with only three or four components, or for a 
triangular law, and for these the extreme observations have an im- 
portance in estimation that far exceeds any they can have on the normal 
law. (The binomial would of course have to be compared with a normal 
law with the chances grouped at equal intervals.) Many series of 
observations have been pubhshed as supporting the normal law. 



288 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

Pearson showed in his original paper that some of these showed such 
departures from the normal law as would warrant its rejection. I have 
myself analysed nine series for this purpose, f Six of these are from a 
paper by Pearson, which has already been mentioned (p. 270). W. N. 
Bond made a series of about 1,000 readings of the position of an illumi- 
nated slit, viewed with a travelling microscope slightly out of focus. 
The slit was kept fixed, but the microscope was moved well outside the 
range of vision after each reading, so that the errors would be as far 
as possible independent. The conditions resemble the measurement of 
a spectrum line or, apart from the shape of the object, that of a star 
image on a photographic plate. Later Dr. H. R. Hulme provided me 
with two long series of residuals obtained in the analysis of the variation 
of latitude observations at Greenwich. These have the special interest 
that they are based on observations really intended to measure some- 
thing and not simply to test the normal law; but Pearson’s were 
primarily designed to test the hypothesis that the error of observation 
could be regarded as the sum of a constant personal error and a random 
error, the test of the normal law being a secondary feature. So many 
lists of residuals exist that could be compared with the normal law that 
published comparisons are under some suspicion of having been selected 
on account of specially good agreement with it. 

In comparison with the normal law. Type VII gives J infinite for 
Type II gives J infinite for any m, but we can modify the 
definition by omitting the intervals where the probability according to 
Type II is zero, and then J remains finite, tending to infinity only as 
m -> 1. It is sufficient for our purposes to use the approximate formula 
of 5.31. The maximum likelihood solutions for /x, which is 1/w for Type 
VII and — 1/m for Type II, are as follows. 




n 

1 f* 

K 

Pearson; Bisection 

I 

600 

-f0-lU±0-037 

0-31 


2 

600 

-1- 004 ±0-04 

17 


3 

600 

-0-226 ±0067 

0-0116 

Bright line . 

1 

619 

±0-230± 0-067 

0-0083 


2 

619 

±0-163±0-060 

0-140 


3 

619 

-0-080± 0-049 

7-6 

Bond .... 


1026 

±0-123±0-061 

2-2 

Greenwich 

1 

4640 

±0-369± 0-020 

10 - 7 « 


2 

5014 

±0-443±0-018 

lO'iw 


Sii of the nine series give K less than 1, three less than 0 01. Allowance 
t PhU. Trans. A, 237, 1938, 231-71 ; M.N.R.A.S. 99, 1939, 703-9. 



§ 6.7 


SIGNIFICANCE TESTS: ONE NEW PARAMETER 


289 


for selection as in 6.04 does not alter this, but the larger values of K 
are, of course, reduced. But there is another check. If the errors, apart 
from a constant personal error, were random and followed the normal 
law, the means of groups of 25 consecutive observations should bo 
derived from a normal law, with standard error J of that of the whole 
series. If y® is the square of the observed ratio, it should be about 0-04. 
In every case the actual value in Pearson’s series was higher; it actually 
ranged from 0-066 to 0-550. The test for comparison of two standard 
errors, with — 20, — 480, will obviously give K much less than 

1 in every case. One apparently possible explanation would be that if 
errors follow a Type VII law, even if they are independent, means 
of a finite number of observations will fluctuate more than on the 
normal law. If this was the right explanation y should increase with /m. 
The actual variation is in the other direction. Taking the values in 
order of decreasing p, we have the following table. 


1 

i 

i 


yt 

r 

Bright line . 

1 

+ 0-230 

0-066 

0-16 


! 2 

+ 0-163 1 

0-100 

0-24 

Bisection 

1 1 

+0-116 

0-093 

0-23 


2 

+0-04 

0-36 

0-57 

Bright line . 

3 

-0-080 

0-140 

0-32 

Bisection 

i 3 

-0-226 

0-550 

0-72 


r is defined as .^(y®— 0-04) and is an estimate of the fraction of the 
standard error that persists through 25 observations. There is a correla- 
tion of —0-92 between p and r, which might represent a practically 
perfect correlation since both p and r have appreciable uncertainties. 
If we fit a linear form by least squares, treating all determinations as 
of equal weight, we get 

p = -f-0-273±0-093— (0-62±0-22)r. 

The suggestion of these results is therefore that reduction in p is 
strongly associated with increase in the correlation between consecu- 
tive errors, and that a set of really independent errors, if there is 
such a thing, would satisfy a Type VII law with m probably between 
2-7 and 6-6. 

Bond’s data would suggest limits for m, corresponding to the standard 
error, of 6*7 to 14; the two Greenwich series of 2*6 to 2*9 and 2-2 to 2*4. 
There appear to be real differences in the values of to, but this has an 
obvious explanation. Pearson’s and Bond’s series were each made by 

WMS.fiS TT 



290 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


a single observer in conditions designed to be as uniform as possible. 
The Greenwich observations were made by several different observers 
in different conditions of observation. This would naturally lead to a 
variation of accuracy. But if several homogeneous series of different 
accuracy, even if derived from the normal law, were combined and the 
result analj^ed, we should get a positive /a. The values foimd from 
the Greenwich observations are therefore likely to be too high for 
uniform observing conditions. It seems that for uniform conditions, 
if independence of the errors can be attained, and if there is a single 
value of m suitable for such conditions, it is hkely to be between 3 
and 5. 

Such a departure from the normal law is serious. We have seen that 
if m < 2- 6 the usual rule for estimating the uncertainty of the standard 
error breaks down altogether, and such values are not out of the 
question. We have therefore two problems. First, since enormous 
numbers of observations have been reduced assuming the normal law 
(or different hypotheses that imply it), we need a means of reassessing 
the accuracy of the summaries. Secondly, it is unusual for a set of 
observations to be sufficiently numerous to give a useful determma- 
tion of m by itself; but if we assume a general value of m we can 
frame a general rule for dealing with even short runs by maximum 
likelihood and accordingly making an approximate adjustment of the 
t rule. 


If we take 


A 



the uncertainty of the error term can be estimated roughly by using 
expectations. If /Aj and are the second and fourth moments of the 
law, we have for Type VII 


A = t‘ = 3 

P2 


m— I 
m—\' 


which is 5 for m = 4, while it is 3 for m infinite. Also 


n 


^ ^ 8 _ 

n(TC— 1)^® 71—1/' 


For the normal law this is 2 /a|/(7i— 1). For m = 4 it is nearly 4/xJ/(n— 1). 
Hence if the mean and the mean-square deviation are used as estimates, 
and m = 4, the probability of error will approximately follow a t rule 
with — 1) degrees of freedom instead of ti— 1. 

If we take ?» = 4 and estimate A and a by maximum likelihood, n«ing 



I 6.7 SIGNIFICANCE TESTS: ONE NEW PARAMETER 291 

the equations 4.31 (10) and (11), it is convenient to have a table of the 
quantity w defined by «,-!:= 1 ^ (x—Xfl2Ma^ 
as a function of {x—X)la. 


( x — X)lo 

w 

(x-A)/<, 

to 

(x— A)/a 

to 

0 

1 000 

2-4 

0-482 

4-8 

0-189 

01 

0-998 

2-6 

0-462 

4-9 

0-183 

0-2 

0-993 

2-6 

0-442 

6-0 

0-177 

0-3 

0-983 

2-7 

0-424 

6-1 

0-171 

0-4 

0-970 

2-8 

0-406 

6-2 

0-166 

0-5 

0-965 

2-9 

0-389 

5-3 

0-160 

0-6 

0-937 

3-0 

0-373 

6-4 

0-165 

0-7 

0-917 

3-1 

0-368 

6-6 

0-160 

0-8 

0-894 

3 2 

0-344 

5-6 

0-146 

0-9 

0-869 

3-3 

0-330 

5-7 

0-141 

10 

0-843 

3-4 

0-317 

5-8 

0-137 

11 

0-816 

3-6 

0-306 

6-9 

0-133 

1-2 

0-788 

3-6 

0-293 

6-0 

0-130 

1-3 

0-760 

3-7 

0-282 

6-1 

0-128 

1-4 

0-732 

3-8 

0-271 

6-2 

0-122 

1-6 

0-706 

3-9 

0-261 

6-3 

0-119 

1-6 

0-677 

4-0 

0-261 

6-4 

0-116 

1-7 

0-660 

4-1 

0-242 

6-5 

0-112 

1-8 

0-623 

4-2 

0-233 

6-6 

0-109 

1-9 

0-698 

4-3 

0-226 

6-7 

0-106 

20 

0-573 

4-4 

0-217 

6-8 

0-104 

21 

0-649 

4-6 

0-209 

6-9 

0-101 

22 

0-625 

4-6 

0-202 

7-0 

0-099 

2-3 

0-503 

4-7 

0-195 




Also M = 2-6797, m/M = 1-49. 

There is no harm in practice in rounding the factors w to two figures. 

Chauvenetf records a set of residuals of the measured semidiameter 
of Venus, in connexion with the problem of rejecting observations. 
Arranged in order of magnitude they are, in seconds of arc: 


Residual 



w 

-1-40 . 



. 0-6 

-0-44 . 



. 0-9 

-0-30 . 



. 1-0 

-0-24 . 



. 1-0 

-0-22 . 



. 1-0 

-0-13 . 



. 10 

-0-06 . 



. 1-0 

+ 0-06 . 



. 10 

+ 0-10 . 



. 1-0 

+ 0-18 . 



. 1-0 

+ 0-20 . 



. 1-0 

+0-39 . 



. 0-9 

+ 0-48 . 



. 0-9 

+ 0-63 . 



. 0-8 

+ 1-01 . 



. _0^ 
13-6 


t Spherical arui Practiced Astronomy, 2, 662. 



292 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

A simple calculation, allowing for the fact that two unknowns have been 
estimated, gave <7 = 0-572'. This suggests the set of values w. With 
these the estimate of A is +0-03', which we may ignore, and 

2 w{x—a)^ = 2-73. 

Then a second approximation to a® is 

2 ^ ^ 2-73 = 0-313, 8 = 0-559'. 

13 

Recomputing with this value we find that the weights are imaltered to 
the first decimal, and we do not need a third approximation. To find 
an effective number of degrees of freedom we compute the right side 
of 4.31 (11) with n = 13, a = 0-65; it is 4-4, so that 

_.£llogi = =48; «,' = 40-559^x48 = 7-6. 

da^ ® 0-091 ® 

To get the uncertainty of a, put A = -)-0-30 in 4,31 (10); the sum on the 
right side becomes —2-78, and 

e* r _ 2-78 

“0-55920-27' 

Hence sx = 0-143 

and the result is A = +0-03±0-14, 7 d.f., 

approximately. 

Chauvenet’s criterion led him to the rejection of the two extreme 
observations and to ct — 0-339. The resulting standard error of the 
mean would be 0-094. But with <7 = 0-56 there are 3 residuals out of 
15 greater than <7, 1 greater than 2a. This is not unreasonable either 
for index 4 or for the normal law. If we reject the two extreme observa- 
tions and use a = 0-34', there are 4 out of 13 greater than a, none 
greater than 2a. This would not be unreasonable for the normal law. 
The distribution by itself provides little evidence to decide whether 
Chauvenet’s method of rejection or the present method is more appro- 
priate. I should say, however, from comparison with other series, that 
there would be a stronger case for the present method, so long as there 
is no reason, recorded at the time of observing, for mistrusting particular 
observations. Even if the extreme observations are rightly rejected, 
the estimate of a is based on 11 degrees of freedom, and from Fisher’s 
z table there is a 6 per cent, chance of a being 1-6 or more timaa the 
estimate. This is increased if observations are rejected. 



*5.7 SIGNIFICANCE TESTS; ONE NEW PARAMETEB MS 

H- 

According to the rough method based on the median, which is in- 
dependent of the law of error, the median would be the eighth observa- 
tion, -)-0-06, and limits corresponding to its standard error would be 
(15/4)'^2 = 1-9 observations away. Interpolated, this puts the limits 
at — 0-12 and -1-0'17, so that the median of the law can be put at 
-f-0'03±0-146. This standard error happens to agree closely with that 
found for index 4. 

The table of weights on p. 291 should be of use in a number of 
problems where there is at present no alternative to either keeping all 
the observations at full weight or rejecting some entirely. The fact that 
an error in m produces to the first order no error in either a ox a ensures 
that even if m is not 4 the hypothesis that it is will not give any serious 
errors. The importance of a very large residual is much reduced, but 
the slow variation of the weight with the size of the residua] prevents 
the large shifts of the mean that may depend on what observations are 
rejected. 


5.8. Test for independence in rare events. Here the null hypo- 
thesis is that the chance of the number of events in an interval of 
observation follows the Poisson rule. Two types of departure from the 
conditions for this rule have been considered, and both have led to the 
negative binomial rule. Both are somewhat artificial. On the other hand, 
any variation of the Poisson parameter, or any tendency of the events 
to occur in groups instead of independently, will tend to spread the 
law and make it more like the negative binomial. Among the various 
possibilities it has the great advantage that it can be definitely stated 
and involves just one new parameter. (Two simple Poisson rules super- 
posed would involve three in all, the two for the separate laws and one 
for the fraction of the chance contained in one of them; and thus two 
new parameters.) If the data support it against the Poisson law, the 
latter is at any rate shown to be inadequate, and we can proceed to 
consider whether the negative binomial itself is satisfactory. 

The Poisson law is the limit of the negative binomial when n oo. 
There is a sufficient statistic for the parameter r, if the law is taken in 
the form we chose in 2.4 (13), but not for n. In a significance test, how- 
ever, we are chiefly concerned with small values of the new parameter, 
which we can take to be 1/n = v. 

The law is 


P{m I r', n, H) 


' n \”n(n-f-l)...{»-fm— 1)/ r' \"‘ 
ji+r'l to! \n-i-r'} 


( 1 ) 



294 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Ch»p. V 


Suppose that in a series of trials the value occurs % times. Then 

2 


( 2 ) 


T _ ( n \" 2 / «.(«.+ r' V 

~ U+r'j i 1 \ } \n+r') 

I dL nTtij, , ^ 11 1 \ 

«+r' r'(n+r') 

Hence the maximum likelihood solution for r' is 

1% ■ 

Thus the mean number of occurrences is a sufficient statistic for r', 
irrespective of w; we have already had this result in the extreme case 
of the Poisson law. The uncertainty of r', however, does depend onn. 

Now form J for the comparison of the above negative binomial law 
with the Poisson law 


( 3 ) 


( 4 ) 


— P(m \r,H) = 


(6) 


If n is large we find 


log 


-71 


log(l+l) 




-|-r-|-77llog 7wlog(7i-|-r')+ 2 log(7l-f«) 

r 8-0 


ni{m—l) 


= _,7log|l+^J+r+w|log^-log^j. 

= (r'-r)^log^-log 
. {r ' — 7 


(6) 


^+y'\ 

n } 


l±l/n^,2_r 


2 

271 


( 7 ) 


1 r® 

2 n^’ 


Hence for large n, r' and v are orthogonal parameters. This is another 
advantage of the form we have chosen for the negative binomial law. 
As V ^ 0 the approximate form of K given in 6.31 should be adapted 


to 



(8) 


where N is the number of trials and estimated values are substituted 
for V and This form is to be used if i' > 8/, if |v| < a, the outside 



§C.8 SIGNIFICANCE TESTS: ONE NEW PARAMETER 295 

A 

factor is larger, tending to when v = 0. If i/ is small should 

be nearly 

For the data on the numbers of men killed by the kick of a horse 
(p. 69) we find ^ ^ 280, r = 0-700, 

and solving for v by minimum taking r as given, we get 

V = +0-053±0-074, 

K == 10exp(— 0-26) = 8. 

The solution is rough; as given by (9) would be about 0-12, the 
difference being due to the fact that the posterior probability distribu- 
tion of V is far from normal. But in any case there is no doubt that the 
data confirm the Poisson rule and more detailed examination is un- 
necessary. 

For the radioactivity data we have similarly 

N = 2608, r = 3-87, v = — 0-0866±0-095I, 
the calculated standard error being 0-072. Then, since the estimate of 
V is negative, we use 2 instead of 8 in (8), and 

K > 60. 

The Poisson law is strongly confirmed. 

In studies of factory accidents made by Miss E. M. Newbold,t strong 
departures from the Poisson rule were found, and there was a fairly 
good fit with the negative binomial. Two of Newbold’s series, fitted by 
minimum x'^, would correspond in the present notation toj 
r = 0-835±0-058, n = 0-99±0-17; N = 447; 

r = 3-91±0-21, n = l-54±0-20; N = 376. 

In these cases v is several times its standard error and its posterior 
probability distribution should be nearly normal. Significance is obvious 
without calculation. But the first series gives more individuals with large 
numbers of accidents than the negative binomial would predict, and it 
seems that this law, though much better than Poisson’s, is not altogether 
satisfactory for this series. Actually the mean number of occurrences 
was 0-978, which differs substantially from r as found by minimum x^, 
although the mean is a sufficient statistic. 

5.9. Introduction of new functions. Suppose that a set of observa- 
tions of a quantity y are made for different values of a variable x. 
t J. R. Slot. Soo. 90, 1927 , 487 - 647 . J Ann. Eugen. 11, 1941 . 108 - 14 . 




296 


SIONIFIOANCE TESTS: ONE NEW PARAMETER Chap. V 


According to the null hypothesis q, the probability of y follows the same 
law for all values of x. According to q' the laws for y are displaced by 
a location parameter depending on x, for instance, a linear function of 
X or a harmonic function a. sin kx. This displacement is supposed speci- 
fied except for an adjustable coefficient a. We have now a complication, 
since the values of x may be arbitrarily chosen, and J will differ for 
different x even if the coefficient is the same. We therefore need to 
summarize the values of J into a single one. 

In problems of this t3^e the probability distribution of x may be 
regarded as fixed independently of the new parameter; the values of x 
may arise from some law that does not contain y, or they may be chosen 
dehberately by the experimenter. In the latter case the previous in- 
formation H must be regarded as including the information that just 
those values of x will occur. Now suppose that the chance of a value 
Xf in an interval Sx, is p,., and that that of y, given x^ is /(x,., a,yr)%r- 
Then for a general observation 

P(Sx„ SyJ = p J(x,., oc,yr)8y^ ( 1 ) 

and for the whole series 


•^ = 22 a+ Aa, yr)-f{x„ <x, y,)} Sy, 

= I,PrJr< (2) 

where is derived from the comparison of the laws for y,. given x^. 

In particular consider normal correlation, stated in terms of the 
regression of y on x. Applying 3.9 (15) to 2.5 (9) for given x, a, t we find 





{pV- 


- J 

„ 1 l-2pp'rlr’+T^lT'^ 1 l-2pp'T7T+r'*/T» 

'^2 1— p'2 '^2 l-p* 




( 3 ) 


This is the case of 3.9 (38) when a = a. 

If aU of a discrete set of values of x have an equal chance of occurring, 
it follows from (2) that J is the mean of the J^. The extension to the 
case where the chance of x is uniformly distributed over an interval is 
immediate. 

Now if there are n values x^, each equally likely to occur, and we make 



§ 6.9 SIGNIFICANCE TESTS: ONE NEW PARAMETER 297 

nm observations, we shall expect that about m observations will be 
made of each value. It seems appropriate, in a case where all the 
are fixed in advance, again to take the mean of the J,.. For if we form 
J for the whole of the observed values of x, it will be ^ take 

TO observations for each value it will be to ^ If omr results are to 
correspond as closely as possible to the case where about to observations 
for each x^ are expected to arise by chance we should therefore divide 
the latter sum by tom. 

Alternatively we may argue that if the number of observed values 
of is large and we take them in a random order, there is an equal 
chance of any particular occurring in a given place in the order, and 
these chances are nearly independent. We then apply (2) directly. 

The distinction is that in the first case we average J over the values 
of X that might occur; in the second we average it over the values of x 
that have actually occurred. The point, stated in other ways, has arisen 
in several previous discussions, and it appears that each choice is right 
in its proper place. In studying the variation of rainfall with latitude 
and longitude, for instance, we might proceed in three ways, (a) We 
might choose the latitudes and longitudes of the places for observation 
by means of a set of random numbers, and instal special rain-gauges 
at the places indicated. Since any place in the area could be chosen in 
this way, it is correct to take the average of J over the region. (6) We 
might deliberately set out the rain-gauges at equal intervals of latitude 
and longitude so as to cover the region. In this case we should take 
the mean of the values of J for the stations, but if the interval is small 
compared with the length and breadth of the region it will differ little 
from the mean over the whole region, (c) We might simply use the 
existing rain-gauges. Again we should take the mean of J for the 
stations. Its actual value, for given ot, will differ from that in (6). The 
stations might, for instance, all be in the southern half of the region. 
But we should consider the situation existing when such a method is 
adopted. There is no observational information for the northern half; 
there is a serious suggestion that the question can be settled from the 
southern half alone. In (a) and (6) the suggestion is that the effect is 
likely to be large enough to be detected from data over the whole region, 
but not likely to be detected from data for half of it. In fact the choice 
of design depends on the previous information and the difference in 
the value chosen for J, as a function of a, expresses the same previous 
information. In testing the significance of a measured parallax of a 
star, for instance, we can and must take into account the fact that we 



298 


SIGNIFICANCE TESTS: ONE NEW PABAMETER Chap. V 


are observing fix>m the Earth, not from a hypothetical planet associated 
with that star or from one in a remote nebula. 

In physical subjects methods analogous to (b) and (c) will usually 
be adopted, (a) is used in some investigations relating to population 
statistics. It has the advantage over (c) that it randomizes systematic 
disturbances other than those directly considered. For instance, actual 
rain-gauges tend to be placed at low levels, whereas (a) and (b) would 
give high stations chances of being selected in accordance with the area 
of high land. In some problems (6) would suffer from a similar dis- 
advantage to (a), though hardly in the present one (cf. also 4.9). 

In what follows we shall follow the rule of (b) and (c) and take the 
summary value of J for given a to be the mean of the values for the 
observed values of the (one or more) independent variables. 

5.91 . Suppose now that on q the measure of a variable x,. for given 
follows a rule 


and that on q' 
Pidx, I 

Then 


1 1 

l7T)<7 ' 


(1) 

exp — 


(2) 

/^^ 


(3) 



(4) 


where the bar indicates a mean over the observed Now in forming 
the likelihood for n observations we obtain the exponent 

( 6 ) 


Let a be the value of oe that makes this stationary. Evidently 

a — 

and (5) becomes 

-^[ lP{tr){cc-ar+ 1 {^r~afMn 


im) 


(6) 

(7) 


The forms of (4) and (7) are exactly the same as in the test for whether 
a single true value agrees with zero; we have only to take this true 
value as being Its estimate is a^J{P(tr)}, and the second sum 

in (7) is the sum of the squares of the residuals. Consequently the whole 
of the tests related to the normal law of error can be adapted imme- 
diately to tests concerning the introduction of a new fimction to 
represent a series of measures. 



SIGNIFICANCE TESTS; ONE NEW PARAMETER 


299 


S 6.9 


5.92. Allowance for old functions. In most actual cases we have 
not simply to analyse a variation of measures in terms of random error 
and one new function. Usually it is already known that other functions 
with adjustable coefficients are relevant, even an additive constant 
being an example. These coefficients must themselves be found from 
the observations. We suppose that they are already known with suffi- 
cient accuracy for the effects of further changes to be linear, and that 
small changes in them make changes ocggg(t^) {s = 1 to m). The new 
function f(t) must not be linearly expressible in terms of the ga{t)', for 
if it was, any change made by it could be equally well expressed by 
changes of the ocg. We can then suppose f(t) adjusted to be orthogonal 
with the g,{t) by subtracting a suitable hnear combination of the 9 ^ 8 ( 0 - 
Then the problem with regard to a, (s = 1 to m) is one of pure estima- 
tion and a factor n d(Xg must appear in the prior probabilities. Inte- 
gration with regard to this will bring in factors (2TTa^y/'^”‘ in the posterior 
probabilities on both q and q', and the integration with regard to a will 
replace the index —\n~\-l in 5.2(22) by = — ^(v— 1) as 

before. But the n in the outside factor arises from the integration with 
respect to the new parameter and is unaltered. Hence the asymptotic 
formula corresponding to 5.2 (22) is 


K 



-i/jv+Va 


As a rule n will be large compared with m and there will be httle loss 
of accuracy in replacing n by v in the outside factor too. 

As an example, consider the times of the P wave in seismology up 
to a distance of 20 ° as seen from the centre of the earth. The observa- 
tions were very unevenly distributed with regard to distance ; theoretical 
considerations showed that the expansion of the time in powers of the 
distance A should contain a constant and terms in A and (A — 1°)®, but 
no term in (A— 1 °)®. The question was whether the observations 
supported a term in (A— 1°)*. A function F^, given by 
^4 = ig^(A-l)«-a- 6 A-c(A-l)®, 


o, b, and c being so chosen that ^’4 should be orthogonal with a constant, 
A, and (A — 1°)® at the weights, was constructed. A least-squares 
solution based on about 384 observations gave the coefficient of ^’ 4 , in 
seconds, as — 0-926J:0-690. Here n = 384 and the index is large enough 
for the exponential approximation to be used; we have then 


K = 



0-926® \ 
2x0-690*/ 


24-6 exp(— 0-9005) = 10-0. 



300 


SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 


The odds are therefore about 10 to 1 that the fourth power is not needed 
at these distances and that we should probably lose accuracy if we 
introduced it. (There is a change in the character of the solution about 
20° that makes any polymomial approximation useless in ranges includ- 
ing that distance; hence the restriction of the solution to observations 
within 20°. ) Here we have also an illustration of the principle of 1.61. 
There was no reason to suppose a cubic form final, its only justification 
being that it corresponds closely to the consequences of having one or 
more thin surface layers, each nearly uniform, resting on a region where 
the velocity increases linearly with depth. The structure of the upper 
layers led to the introduction of A — 1° in place of A, and to the constant 
term in the time. The success of one form with three adjustable con- 
stants was then enough to show, first, that it was not in any case 
permissible on the data to introduce four, and hence that any other 
permissible formula must be one with three constants; second, that such 
a form, if it was to be valid, must give times agreeing closely with those 
given by the cubic. 

5.93. Two sets of observations relevant to the same parameter. 
It often happens that two measurable quantities x, y are related on q' 
in such a way that 

X = o‘f(t)±<r, y = hag{t)±T, (1) 


where /{<), qit) are known functions whose mean squares over the ob- 
served values are 1, and i is a known constant. For instance, in the 
measiu'ement of the parallax of a star x, y may be the apparent 
disturbances in right ascension and declination, the theoretical values 
of which are the product of the unknown parallax into two known 
functions of the time. The mean square values of these functions need 
not be equal; hence if we use the form (1) the constant k will be needed. 
We take a, t as known. Then 


a* , k^oi^ 


( 2 ) 


where 


P{da\q'H) = - 


1 Ada. 


TT 1-f 


^ = - 2 +- 2 * 


(3) 

(4) 


Let a, 6 be the maximum likelihood estimates of a from the observa- 
tions of X and y separately, «'* and the mean square residuals ; then 


F(q 1 oc exp 




1 2(7* 

2t* / 


(6) 



SIGNIFICANOE TESTS: ONE NEW PABAMETEB 


301 


I 6.9 


P(q' 1 dH) OC 


r ( na'^ nt'^ 

n{a— a)* 

a)*l 

1 A dot 

J 2a* 2 t* 

2a* 

2t* j 

■n l-f.4*a»‘ 
(6) 


The maximum of the exponent is at 



aja^+k^blr^ 


“ l/a*-(-ifc*/T* 

and, approximately, 



P{q' 1 dH) OC 


1 ms'* mt'* m^:*(o— 6)®1 
! 2a* 2t* 2(T2-f A*a*)) ’ 

Kr^ 

y(f)“'’i 

[ m«* mfc*6* mi*(o — 6)* 
t 2a* 2t* ' 2{T*-+-jfc*a*), 

== 

y(f)“H 

[ m(oT*-|-I;*ia*)*l 

i 2a*T*(T*-|-fc*a*)l 

= 

VItH 

f (alal+bjam 
[ 2(l/s* + l/8g)|’ 


(7) 

(8) 


( 9 ) 


where and are the separate standard errors. 

When k is large or small the exponent reduces to —^nk^b^lr^ or 
— as we should expect. For intermediate values of k, K may 

differ considerably according as the two estimates a, b have the same 
sign or opposite signs, again as we should expect. 

5.94. Continuous departure from a uniform distribution of 
chance. The chance of an event may be distributed continuously, often 
uniformly, over the range of a measured argument. The question may 
then be whether this chance departs from the distribution suggested. 
Thus it may be asked whether the variation of the numbers of earth- 
quakes from year to year shows any evidence for an incre€«e, or whether 
from day to day after a large main shock it departs from some simple 
law of chance. We consider here problems where the trial hypothesis 
q is that the distribution is uniform. We can then choose a linear 
function t of the argument x, so that t will be 0 at the lower and 1 at 
the upper limit. The chance of an event in a range dx is then dt, and 
that of n observations in specified ranges is (dt), provided that they 
are independent. 

The alternative q' needs some care in statement. It is natural to 
suppose that the chance of an event in an interval dt is 




( 1 ) 


X 

where /(<) is a given function and J f{t)dt = 0. This is satisfactory when 

0 

a is small, but if a is large it no longer seems reasonable to take the 



302 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

disturbance for each t as proportional to the same constant. Consider 
a circular disk, on which marbles are dropped, while the tray is agitated 
in its own plane. If the tray is horizontal the chance of a marhle coining 
off is uniformly distributed with regard to the azimuth 6. If it is slightly 
tilted in the direction 6 = 0, the chance will approximate to the above 
form with J{t) = cos 6 . But with a larger tilt nearly the whole of the 
marbles will come off on the lower side, so that the chance on the upper 
side approximates to 0 and its distribution deviates completely from (1), 
with fit) = cosd, for any value of a; if we took a > 1 we should get 
negative chances, and with any a 1 the chance of values of 6 between 
\iT and fv would not be small. With still greater slopes nearly all the 
marbles would come off near the lowest point. Thus with an external 
force accurately proportional to cos 6, for any given slope, the resulting 
chance distribution may vary from a uniform one to one closely con- 
centrated about a single value of in a way that cannot be represented 
even roughly by any function of the form (1). 

If, however, we take in this case 

PidB I q'otH) = A exp(a cos 6) dO, (2) 

where A J" exp(acos0) = 1, (3) 

the conditions of the problem are satisfied. Negative chances are ex- 
cluded, and with sufficiently large a the chance can be arbitrarily closely 
concentrated about 0 = 0. Hence instead of (1) it seems reasonable to 
take 1 

P(dt I q'oLH) = exp{a/(<)}y J exp{Q(/(<)} dt, (4) 

0 

where a may have any finite value. 

Comparing with the null hypothesis a = 0 we see that can range 
from — 1 » to 00, and for small <x 

1 

I exp(a/(<)} dt = 0(a*), (6) 

0 

1 

J = J cxf{t){expafit)—l}dt 
0 

^ (6) 

Without loss of generality we can take 

Pit) = 1. 


( 7 ) 



$ 6.9 

Then 


SIGNIFICANCE TESTS: ONE NEW PAKAMETEB 


303 


j exp{a/(0}d< = (1 + ia®) = exp Jot®, (8) 

P{q\H) = ^, (9) 


for small a. 

Let n observations occur in the intervals dt^. Then over the range 
where the integrand is appreciable 


P(d\qH) = lim> ( 11 ) 

P{e I q'ocH) 4= exp{Q£ H (^^r). (12) 


1 

K 


00 

- f exp(a2/(«r)-W}T^ 

TT J l + Ot^ 

~GD 

[nrr) P|. 2n \l+{iminr 


(13) 


This is valid if is not large; but then yf{t^)/n will be small 

\n 

and the last factor will approximate to 1. Hence 


K 




(14) 


provided the estimate of a, namely - y/(<r)> is small. 

n 

The solution in the first edition used (1) and contained a factor c 
representing the range of a. permitted by the condition that a chance 
cannot be negative. This complication is rendered unnecessary by the 
modification (4). 

5.95. It will be noticed in all these tests that the hypotheses, before they 
are tested, are reduced to laws expressing the probabilities of observable 
events. We distinguish between the law and its suggested explanation, 
if there is any — ^it is perfectly possible for a law to be established 
empirically without there being any apparent explanation, and it is 
also possible for the same law to have two or three different explana- 
tions. When stellar parallax was first discovered the question was 
whether the measured position of a star relative to stars in neighbouring 
directions showed only random variation or contained a systematic 
part with an annual period, the displacement from some standard 
position being related in a prescribed way to the earth’s position 



804 SIGNIFICANCE TESTS: ONE NEW PARAMETER Chap. V 

relative to the sun. This can be stated entirely in terms of the proba- 
bilities of observations, without further reference to the explanation by 
means of the possible finite distance of the star. The latter is reduced, 
before the test can be applied, to a suggestion of one new parameter 
that can be tested in the usual way. It happens here that the explana- 
tion existed before the relevant observations did; they were made to 
test a hypothesis. But it might well have happened that study of 
observations themselves revealed an annual variation of position 
between visually neighbouring stars, and then parallax would have 
been established — at first under some other name — and the theoretical 
explanation in terms of distance would have come later. Similarly the 
test of whether the universe has a finite curvature is not to be settled by 
‘philosophicar arguments claiming to show that it has or has not, but 
by the production of some observable result that would differ in the two 
cases. The systematic change of this result due to assuming a finite 
radius R would be the function f{t) of a test. Its coeflBcient would 
presumably be proportional to some negative power of R, but if a test 
should reveal such a term the result is an inductive inference that will 
be useful anyhow; it remains possible that there is some other explana- 
tion that has not been thought of, and there is a definite advantage in 
distinguishing between the result of observation and the explanation. 



VI 

SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 

'What’s one and one and one and one and one and one and one and one and 
one and one ? ’ 

‘I don't know,’ said Alice, ‘I lost count.’ 

‘ She can’t do addition,’ said the Red Queen. 

Lewis Cabboll., Through the Looking -Olasa. 

6.0. Combination of Tests. The problems discussed in the last 
chapter are all similar in a set of respects. There is a clearly stated 
hypothesis q under discussion, and also an alternative q' involving one 
additional adjustable parameter, the possible range of whose values is 
restricted by the values of quantities that have a meaning even if the 
new parameter is not introduced. We are in the position at the outset 
of having no evidence to indicate whether the new parameter is needed, 
beyond the bare fact that it has been suggested as worth investigating; 
but the mere fact that we are seriously considering the possibility that it 
is zero may be associated with a presumption that if it is not zero it is 
probably small. Subject to these conditions we have shown how, with 
enough relevant evidence, high probabilities may be attached on the 
evidence, in some cases to the proposition that the new parameter is 
needed, in others to the proposition that it is not. Now at the start of 
a particular investigation one or more of these conditions may not be 
satisfied, and we have to consider what corrections are needed if they 
are not. 

In the first place, we may have previous information about the values 
permitted on q'. This may occur in two ways. In the problem of the 
bias of dice, we supposed that the chance of a 5 or a 6, if the dice were 
biased, might be anything from 0 to 1. Now it may be said that this 
does not represent the actual state of knowledge, since it was already 
known that the bias is small. In that event we should have over- 
estimated the permitted range and therefore A; the evidence against 
q is therefore stronger than the test has shown. Now there is something 
in this objection; but we notice that it still implies that the test has 
given the right answer, perhaps not as forcibly as it might, but quite 
forcibly enough. The difficulty about using previous information of this 
kind, however, is that it belongs to the category of imperfectly cata- 
logued information that will make any quantitative theory of actual 
belief impossible imtil the phenomena of memory themselves become 
the subject-matter of a quantitative science; and even if this ever 

S6M.SS T 



306 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

happens it is possible that the use of such data will be entirely in the 
study of memory and not in, for instance, saying whether dice have a 
bias. However, all that we could say from general observation of dice, 
without actually keeping a record, is that all faces have sometimes 
occurred; we could not state the frequency of a 5 or a 6 more closely 
than that it is unlik ely to have been under 0-1 or over 0-5. Such infor- 
mation would be quite useless when the question is whether the chance 
is J or 0-3377 ; and it may as well be rejected altogether. V ague informa- 
tion is never of much use, and it is of no use at all in testing small effects. 

The matter becomes clearer on considering the following problem. 
Suppose that we take a sample of n to test an even chance. The approxi- 
mate formula 5.1 (9) is 

K = (2«/7r)’/i*exp(-^X^). ( 1 ) 

Now suppose that we have a sample of 1,000 and that the departure 
makes K less than 1 . If we divide the data into 9 groups and test each 
separately the outside factor for each is divided by 3; but at the same 
time we multiply all the standard errors by 3 and divide the contribu- 
tion to X* from a given genuine' departure by 9. Thus a departure that 
would be shown by a sample of 1,000 may not be shown by any one 
of its sections. It might be said, therefore, that each section provides 
evidence for an even chance; therefore the whole provides evidence for 
an even chance; and that we have an inconsistency. This arises from 
an insufficient analysis of the alternative q'. The hypothesis 5 is a 
definitely stated hypothesis, leading to definite inferences, q' is not, 
because it contains an unknown parameter,! which we have denoted 
by p', and would be ^ on o' but might be anything from 0 to 1 on g''. 
Anything that alters the prior probability of p' will alter the inferences 
given by q'. Now the first sub-sample does alter it. We may start with 
probability ^ concentrated &tp = ^ and the other ^ spread from 0 to 1 . 
In general the first sub-sample will alter this ratio and may increase 
the probability that p = \', but it also greatly changes the distribution 
of the probabihty of p' given q', which will now be nearly normal 
about the sampling ratio with an assigned standard error estimated 
from the first sample. It is from this state of things that we start when 
we make our second sub-sample, not from a imiform distribution on 
q'. The permitted range has been cut down, effectively, to something 
of the order of the standard error of the sampling ratio given by the 
first sample. Consequently the outside factor in (1) is greatly reduced, 

t This distinction appears also in Fisher’s theory: see The Design of Experiments, 
1935, p. 19. 



§6.0- SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 307 

i 

and the second sample may give support for q' at a much smaller 
value of the estimated p' — \ than if it started from scratch. We cannot 
therefore combine tests by simply multiplying the values of K. This 
would assume that posterior probabilities are chances, and they are not. 
The prior probabihty when each sub-sample is considered is not the 
original prior probability, but the posterior probabihty left by the 
previous one. We could proceed by using the sub-samples in order in 
this way, but we already know by 1.5 what the answer must be. The 
result of successive apphcations of the principle of inverse probabihty 
is the same as that of applying it to the whole of the data together, 
using the original prior probabihty, which in this case is the statement 
of ignorance. Thus if the principle is apphed correctly, the probabihties 
being revised at each stage in accordance with the information already 
available, the result will be the same as if we apphed it directly to the 
complete sample; and the answer for this is given by (1). It follows 
that the way of combining significance tests is not to multiply the K's, 
but to add the values of n in the outside factors and to use a x® based 
on the values estimated for p' and its standard error from aU the 
samples together. 

In the dice problem, therefore, the information contained in, say, 
1,000 previous trials, even if they had been accurately recorded, could 
affect the result only through (1) a change in n, which would alter K by 
about 1 part in 600, (2) changes in the estimated p', about which we 
are not in a position to say anjdhing except by using Weldon’s sample 
itself as our sole data, (3) a reduction of the standard error by 1 in 600. 
The one useful thing that the previous experience might contain, the 
actual number of successes, is just the one that is not suflSciently 
accurately recalled to be of any use. Thus in significance tests, just as 
in estimation problems, we have the result that vaguely remembered 
previous experience can at best be treated as a mere suggestion of 
something worth investigating; its effect in the quantitative application 
is utterly negligible. 

Another type of previous information restricting the possible values 
of a new parameter, however, is important. This is where the existence 
of the new parameter is suggested by some external consideration 
that sets limits to its magnitude. A striking illustration of this is the 
work of Chapman and his collaborators on the lunar tide in the atmo- 
sphere.f From dynamical considerations it appears that there should 
be such a tide, and that it should be associated with a variation of 
t M.N.B.A.S. 78, 1918, 635-8; Q.J.B. Met. Soc. 44, 1918, 271-9. 



308 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

pressure on the ground, of the order of the load due to a foot of air 
or O'OOl inch of mercury. Actual readings of pressure are usually made 
to 0-001 inch, which represents the observational error; but the actual 
pressure fluctuates in an irregular way over about 3 inches. Now we 
saw that the significance test would lead to no evidence whatever about 
the genuineness of an effect until the standard error had been reduced 
by combining numerous observations to something comparable with 
the permitted range, and that it could lead to no decisive result until 
it had been made much less than this. The problem was therefore to 
utilize enough observations to bring the standard error down from 
about an inch of mercury to considerably under 0-001 inch — requiring 
apparently about 10^ observations. In view of the large fluctuation 
present and unavoidable, Chapman rounded off the last figure of the 
pressures recorded; but he also restricted himself to those days when 
the pressure at Greenwich did not vary more than 0- 1 inch, so that the 
standard error of one observation is reduced to 0-1/ V3 inch; and 
combined hourly values of pressure for those days over 63 years, in- 
cluding 6,457 suitable days. Now 0-1/(3x6467x24)’/== 0-00014. A 
definite result should therefore be obtained if there are no further com- 
plications. There might well be, since consecutive hourly values of a 
continuous function might be highly correlated and lead to an increase 
of uncertainty. Special attention had also to be given to the elimina- 
tion of solar effects. The final result was to reveal a lunar semidiurnal 
variation with an amplitude of 0-000356 inch, the significance of which 
is shown immediately on inspection of the mean values for different 
distances of the moon from the meridian. 

In such a case, where the hypothesis q' , that the effect sought is 
not zero, itself suggests a limit to its amount, it would obviously be 
imfair to apply the same test as in the case of complete previous 
ignorance of the amount. The range in which the parameter is sought 
is much less and the selection to be allowed for in choosing an estimate 
on q' is much less drastic and therefore requires a smaller allowance. 

These considerations suggest an answer to the question of how signi- 
ficance tests should be combined in general. It often happens that we 
get a series of estimates of a parameter, from different sets of data, 
that all have the same sign and run up in magnitude to about twice 
the standard error. None of them taken by itself would be significant, 
but when they all agree in this way one begins to wonder whether they 
can all be accidental; one such accident, or even two with the same 
sign, might pass, but six may appear too many. We have seen how to 



§6.0 SIGNIFICANCE TESTS: VABIOUS COMPLICATIONS 


300 


do the combination for the test of a sampling ratio. Similar considera- 
tions will .apply to measures, so long as the standard errors of one 
observation are the same in all series. If they differ considerably a 
modification is needed, since two equal departures with the same 
standard error may give different results in a test when one is based 
on a few accurate observations and the other on many rough ones. 
The outside factor will not be simply (it ^ since what it really 

depends on is the ratio of the range of the values initially possible to 
the standard error of the result. The former is fixed by the smallest 
range indicated and therefore by the most accurate observations, and 
the less accurate ones have nothing to say about it. It is only when 
they have become numerous enough to give a standard error of the 
mean less than the standard error of one observation in the more 
accurate series that they have anything important to add. If they 
satisfy this condition the outside factor will be got from 5.0(10) by 
taking /(a) from the most accurate observations, and a and s from all 
the series combined. 

These considerations indicate how to adapt the results of the last 
chapter to deal with most of the possible types of departure from the 
conditions considered there. One further possibility is that q and q' 
may not be initially equally probable. Now, in accordance with our 
fundamental principle that the methods must not favour one hypothesis 
rather than another, this can occur only if definite evidence favouring 
q or q' is actually produced. If there is none, they are equally probable. 
If there is, and it is produced, it can be combined with the new informa- 
tion and give a better result than either separately. This difficulty can 
therefore easily be dealt with, in principle. But it requires attention to 
a further point in relation to Bernoulli’s theorem. All the assessments 
of prior probabihties used so far have been statements of previous igno- 
rance. Now can they be used at all stages of knowledge ? Clearly not; 
in the combination of samples we have already seen that to use the 
same prior probability at all stages, instead of taking information into 
account as we go on, will lead to seriously wrong results. Even in a pure 
estimation problem it would not be strictly correct to find the ratios 
of the posterior probabilities for different ranges of the parameter by 
using sections of the observations separately and then multiplying the 
results, though the difference might not be serious. If we are not to run 
the risk of losing essential information in our possession, we must 
arrange to keep account of the whole of it. This is clear enough in 
specific problems. But do we learn anything from study of one problem 



310 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

that is relevant to the prior probabilities in a different one ? It appears 
that we do and must; for if the prior probabilities were fixed for all 
problems, since there is no limit to the number of problems that may 
arise, the prior probabihties would lead to practical certainty about the 
fraction of the times when q wiU be true, and about the number of times 
that a sampling ratio will lie in a definite range. But this would almost 
contradict our rule 6, that we cannot say anything with certainty about 
experience from a priori considerations alone. The distinction between 
certainty and the kind of approximation to certainty involved in 
Bernoulli’s theorem makes it impossible to say that this is a definite 
contradiction, but it appears that the statement that even such an 
inference as this can be made in this way is so absurd that an escape 
must be sought. The escape is simply that prior probabilities are not 
permanent; the assessments will not hold at all stages of knowledge, 
their function being merely to show how it can begin. It is a legitimate 
question, therefore, to ask what assessments should replace them in any 
advanced subject, allowing for previous experience in that subject. The 
point has been noticed by Pearson in a passage already quoted (p. 115). 
When melting was first studied quantitatively it would have been right 
to attach prior probability ^ (or J as suggested in 3.2 (20)) to the propo- 
sition that a given pure substance would have a fixed melting-point, or, 
more accurately, that variations of the observed melting-point are 
random variations about some fixed value. It would be ridiculous to 
do so now. The rule has been established for one substance, and then 
for many; then the possibihty that it is true for all comes to be seriously 
considered, and giving this a prior probabihty ^ or J we get a high 
posterior probabihty that it is true for all; and it is from this situation 
that we now proceed. 

For the elementary problem of chances, similarly, we may begin with 
a fiaiite prior probabihty that a chance is 0 or 1 ; but as soon as one 
chance is found that is neither 0 nor 1, it leads to a revision of the 
estimate and to the further question, ‘Are all chances equal ? ’ which 
a significance test answers in the negative; and then, ‘Do chances show 
any significant departure from a uniform distribution ? ’ Pearsonf says 
that ‘chances he between 0 and 1, but our experience does not indicate 
any tendency of actual chances to cluster round any particular value 
in this range. . . . Those who do not accept the hypothesis of the equal 
distribution of ignorance are compelled to produce definite evidence of 
the clustering of chances, or to drop aU apphcation of past experience 

t Phil. Mag. 13, 1907, 366. 



5 6.0 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 311 

to the judgement of probable future statistical ratios. It is perfectly 
easy to form new statistical algebras with other clustering of chances.’ 
Accepting this statement for a moment, the accurate procedure at 
present would be to collect determinations of chances and take the 
prior probabilities of 0, 1, and intermediate values in proportion to 
the observed frequencies. The important point in this passage is the 
recognition that the Bayes-Laplace assessment is not a definite state- 
ment for all time, and that previous information from similar problems 
is relevant to the prior probability. But the statement is incomplete 
because in some subjects chances do cluster. The uniform assessment 
might have been right in genetics at the time of Mendel’s original 
experiment, but a modern MendeUan would be entitled to use the 
probabilities indicated by the observed frequencies of 0:1, 1:1, 1:3, 
3:5,... ratios in interpreting his results, and in fact does so roughly. 
Mendel’s first results rested on about 8,000 observations; some hundreds 
would not usually be considered enough, and this corresponds to 
the fact that all that is now needed is to estabhsh a high probability 
for one ratio compatible with the Mendelian theory against the others 
that have previously occurred and a background of other ratios attri- 
butable to differences of viability. Correlations in meteorology seem 
to be very evenly distributed, but those between human brothers seem 
to collect about -f 0-5. A chemist wanting the molecular weight of a 
new compound would not content himself with a statement of his own 
determination. He carries out a complete analysis, finds one constitu- 
tion consistent with all the data, and if he wants the accurate molecular 
weight for any other purpose he will calculate it from the International 
Table of Atomic Weights. The uncertainty will be that of the calculated 
value, not his own. Thus previous information is habitually used and 
allowed for, and it is not in all subjects that the previous information 
is of the type considered by Pearson in the passage quoted. It is not 
valid to group all estimates of chances or other parameters together to 
derive a revision of the prior probabilities, because the grouping is 
known to be different in different subjects, and this is already allowed 
for in practice, whether explicitly or not, and perhaps more drastically 
than theory would indicate. Thus differences of procedure in different 
subjects are largely explicable in terms of differences in the nature of 
previous results, allowed for in a way equivalent to reassessments of the 
prior probabilities based on previous experience. There is no need to 
assume any difference in the fundamental principles, which themselves 
provide means of making such reassessments. It is, in fact, desirable 



312 SIGNIFICANCE TESTS: VABIOUS COMPLICATIONS Chap. VI 

that the results of a subject should be analysed at convenient intervals 
so as to see whether any alteration will be needed for future use, in 
order that its inferences should represent as accurately as possible the 
knowledge available at the times when they are made. Any subject in 
its development provides the kind of information that is needed to 
bring its prior probabilities up to date. At present, however, we must 
be content with approximations, and in some subjects at any rate there 
seems to be no need for any immediate modification of the assessments 
used to express ignorance. In subjects where statistical methods have 
hitherto had little application they are suitable as they stand. It is 
clear that we cannot revise them in the same way in all subjects; 
experience in genetics is applicable to other problems in genetics, but 
not in earthquake statistics. 

There is one possible objection to reassessment; if it is carried out, 
it will convince the expert or the person willing to believe that we have 
used the whole of the data and done the work correctly. It will not 
convince the beginner anxious to learn; he needs to see how the learning 
was done. We have already had some examples to the point. The data 
on criminahty of twins on p. 238 were taken from Fisher’s book, and 
quoted by him from Lange. Now both Lange and Fisher already knew 
a great deal about like and unlike twins, and it is possible that, on their 
data, the question of a significant difference was already answered, and 
the only question for them was how large it was — a pure problem of 
estimation. But a person that knows of the physical distinction, but 
has never thought before that there might be a mental one too, should 
be convinced on these data alone by a A of 1/170. Compare with this 
the results of the cattle inoculation test, where K = 0-37. The odds 
on these data that the inoculation is useful are about the same as that 
we shall pick a white ball at random out of a bag containing three 
white and one black, or that we shall throw a head within the first two 
throws with a penny. The proper judgement on these data is, ‘Well, 
there seems to be something in it, but I should want a good deal more 
evidence to be satisfactorily convinced.’ If we say, ‘Oh, but we have 
much more evidence ’ , he is entitled to say, ‘Why did you not produce it ? ’ 
(I may say that m this case I have not the slightest idea what other 
evidence exists.) The best inference is always the one that takes account 
of the whole of the relevant evidence; but if somebody provides us with 
a set of data 6^ and we take account also of additional information 6^, we 
shall obtain P{g | d^O^H), and if we do not tell him of 6^, it is not his 
fault if he thinks we are giving him P{q | diH) and confusion arises. 



§6.1 BIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 


818 


6 . 1 . Several new parameters often arise for consideration simultane- 
ously. This can happen in several ways. All may be independently 
suggested for consideration, and it merely happens that a set of observa- 
tions is capable of providing answers to several independent questions, 
or even, in experimental work, that it has been convenient to design 
an experiment deliberately so as to answer them all. This is merely 
a slight extension of the case of one new parameter. Each parameter 
can be tested separately against the standard error by the usual rule. 
Thus in agricultural experiments the comparisons of the productivities 
of two varieties of crop and of the eifects of two fertilizers are questions 
set at the start, presumably because they are worth asking, and the 
answer to one has nothing directly to do with the other. 

In such cases wo shall need a joint prior probability distribution for 
the two new parameters in case they may both be accepted, and consis- 
tency requires a symmetrical method. If the parameters are a, /3, we 
can write q for the proposition rx = ^ = 0, for a 0, ^ = 0, q^ for 
a = 0, ^ # 0, and q^^ for a ^ 0, ^ 0. Then it may appear that if we 

test q^ first and then q^^, we should form J for comparison of these and 
use it to give a prior probability distribution for given a. But this 
leads to an inconsistency. With an obvious notation, it will not in general 
bo true that 

dtan~^JJ'’.dtan“^J^^*^ = dtan-^J^^“.dtan“^J^^ 


so that we might be led to different results according to which of a and 
we tested first. We can obtain symmetry if we take 


P(da.d^ I H) — 


1 dJ"^ 1 djf 

TT n 


(with the usual modifications if or cannot range from — oo to oo). 
Thus a and § are always compared with the hypothesis that both are 
zero. 

For reasons already given (5.45) I do not think that this need for 
symmetry applies if ci is a location parameter and j8 a standard error. 

6 . 11 . A common case is where we may have to consider both whether 
a new function is needed and whether the standard error needs to be 
increased to allow for correlation between the errors. Here two para- 
meters arise; but the test for the first may wefi depend on whether we 
accept the second. This can be treated as follows. Let a be the coefficient 
of the new function, p the intraclass correlation between the observa- 
tions. Then we have to compare four alternatives, since either a or p 
may be 0 independently. Then let q be the proposition a = 0, p = 0. 



314 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 


is the proposition a ^ 0 , p = 0 ; q^ ie a = 0 , p 0 , and q^^ is a 9 ^ 0 , 
p ^ 0. Then we can work out as usual 

__ P(ql 0 H) ^ __ P(qim 

“ PiqjeHy '> P(q,lOBy 

If these are both > 1, g* is confirmed in both cases and may be retained. 
If one of them is > 1 and the other < 1, the evidence is for the 
alternative that gives the latter, and against q. Thus q is disposed of 
and we can proceed to consider the fourth possibility. Now 

P{qjeH) _K^ 

P{q,\eH) K, 

and the more probable of the second and third alternatives is the one 
with the smaller K. The relevance of this parameter may then be 
inferred in any case. Suppose that this is Then we have established 
internal correlation and the original standard errors are irrelevant to the 
test of q^p against q^. The comparison will therefore be in terms of the 
summaries by ranges or classes, not the individual observations; the 
standard error found for a will be larger than on and it is possible 
that .Ka may be less than 1 and yet that the data do not support, a when 
allowance is made for p. If, however, a is still supported we can assert 
that neither a nor p is 0. On the other hand, if q^ is asserted by the first 
pair of tests we can still proceed to test p. Thus a decision between the 
four alternatives can always be reached. 

Referring again to Weldon’s dice experiment, we have an interesting 
illustration. The data as recorded gave the numbers of times when 
the 12 dice thrown at once gave 0, 1, 2 ,..., 12 fives and sixes. The test 
for a departure of the chance from ^ showed that the null hypothesis 
must be rejected, but the evidence might conceivably arise from a non- 
independence of the chances for dice thrown at the same time. This 
was tested by Pearson by computing the expectations of the numbers 
of times when 0, 1 ,... fives and sixes should be thrown with the revised 
estimate of the chance, 0-33770, and forming a new with them. In 
Fisher’s revision, -f- in which a little grouping has been done, the revised 
X® is 8-2 on 9 degrees of freedom, so that independence may be con- 
sidered satisfactorily verified and the bias accepted as the explanation 
of the observed departure of the sampling ratio from J. 

6 . 12 . Similar considerations will apply in many other cases where 
two or more parameters arise at once; there is a best order of procedure, 
which is to assert the one that is most strongly supported, reject those 

f Statistical Methods, p. 67. 



§6.1 SIGNIFICANCE .TESTS: VARIOUS COMPLICATIONS 316 

that are denied, and proceed to consider further combinations. The 
best way of testing differences from a systematic rule is always to 
arrange our work so as to ask and answer one question at a time. Thus 
William of Ockham’s rule.f ‘Entities are not to be multiplied without 
necessity’ achieves for scientific purposes a precise and practically 
applicable form: Variation is random until the. contrary is shown; and 
new parameters in laws, when they are suggested, must be tested one at a 
time unless there is specific reason to the contrary. As examples of specific 
reason we have the cases of two earthquake epicentres tested for iden- 
tity, where, if there is a difference in latitude, there would ordinarily be 
one in longitude too, or of a suggested periodic variation of unknown 
phase, where a cosine and sine would enter for consideration together. 

This rule for arranging the analysis of the data is of the first im- 
portance. We saw before that progress was possible only by testing 
hypotheses in turn, at each stage treating the outstanding variation 
as random; assuming that progress is possible we are led to the first 
part of the statement, and have developed means for putting it into 
effect, but the second has emerged from the analysis of its own accord. 
It is necessary to a practical development, for if it could be asked that 
an indefinite number of possible changes in a law should be considered 
simultaneously we should never be able to carry out the work at all. 
The charge, ‘you have not considered all possible variations’ is not an 
admissible one; the answer is, ‘The onus is on you to produce one.' The 
onus of proof is always on the advocate of the more complicated 
hypothesis. 

6.2. Two new parameters considered simultaneously. There 
are many cases where two parameters enter into a law in such a way 
that it would be practically meaningless to consider one without the 
other. The typical case is that of a periodicity. If it is present it implies 
the need for both a sine and a cosine. If one is needed the other will 
be accepted automatically as giving only a determination of phase. 
There may be cases where more than two parameters enter in such a 
way, as in the analysis of a function of position on a sphere, where all 
the spherical harmonics of the same degree may be taken at once. 

t William of Ookham (d. 1349 ?), known as the Invincible Doctor and the Venerable 
Inceptor, was a remarkable man. He proved the reigning Pope guilty of seventy errors 
and seven heresies, and apparently died at Mimich with so little attendant ceremony that 
there is even a doubt about the year. See the C.D.N.B. The above form of the principle, 
known as Ockham’s Razor, was first given by John Ponce of Cork in 1639. Ookham and 
a number of contemporaries, however, had made equivalent statements. A historical 
treatment is given by W. M. Thorburn, Mind, 27, 1918, 345-53. 



316 SIGNIFICANCE TESTS: VABIOUS COMPLICATIONS Chap. VI 


The simplest possible case would be the location of a point in rect- 
angular coordinates in two dimensions, where the suggested position 
is the origin, and the standard errors of measures in either direction are 
equal. If the true coordinates on q' are A, (x, we find 

J = (A2+p2)/a2. (1) 

Our problem is to give a prior probabihty distribution for A, p given o. 
We suppose that for given A^+zx* the probability is uniformly distributed 
with regard to direction. Two suggestions need consideration. 

We may take the probabihty of J to be independent of the number 
of new parameters; then the rule for one parameter can be taken over 
unchanged. Taking polar coordinates p, ^ we have then 


P{dp I q'aH) == P{dJ | q'oH) = 


2 2 a dp 

TT \ -\-J TT 


P{dMpL\q'aH) = - - 

TT a 


a dp d(f> 


1 a dXdp 

TT^ p(a“+p2)’ 


( 2 ) 

(3) 


since p can range from 0 to oo. Integrating with regard to p. we find 

dX 


P(dX qaH) = —Jog y j o , 


W 


Alternatively we might use such a function of J that the prior proba- 
bility of A or p, separately would be the same as for the introduction of 
one new parameter. Such a function would be 

dXdp 


P{dXdp\q'<jH) = 


(5) 


277 {(72-fp2)%’ 

This would lead to the consequence that the outside factor in K, for n 
observations, would bo 0{n). This is unsatisfactorj'. At the worst we 
could test the estimate of A or p, whichever is larger, for significance as 
for one new parameter and allow for selection by multiplying A by 2, 
and the outside factor would still be of order This would sacrifice 
some information, but the result should be of the right order of 
magnitude. 

To put the matter in another way, we notice that, if A/cr is small, (3) 
and (4) lead to a presumption that p is small too, on account of the 
factor 1/p. This is entirely reasonable. If we were simply given a value 
of A with no information about p except that the probability is uni- 
formly distributed with regard to direction we should have a Cauchy 


law for p: 


P{dplAA) = -^^ 



SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 


But with (6), even if A/ct is small, P{d(jL \ q'oXH) still has a scale factor 
of order a. That is, (3) provides a means of sa 3 dng that if X/o is found 
small in an actual investigation, and we are equally prepared for any 
value of <f>, then /x/a is likely to be small also. (5) provides no such 
means. 

The acceptance of (3) and therefore (4) leads to a curious consequence, 
namely that if measures are available of only one of A, /x the prior proba- 
bility distribution for that one is appreciably different from the one we 
used for a single new parameter. But I think that the arguments in 
their favour are much stronger than this one. 

We therefore adopt (3). Each observation is supposed to consist of 
a pair of measures referring to A, ja; we write the means as x, y 
and put 2^,2 _ 2 {x,-xf+ 2 iy-Vf- (6) 

The analysis proceeds as follows. 

P{qd<, I H) QC data, P(q'dadXdf. | H) oc (7) 


whence 


77-2p(a2-fp2)’ 


P(qdarl0H) GC -Lexpj — 


2ns'^-\-n(x'^-\-y^)\ da 

2a^ I a ’ 


P{q' dadXdy. \ Oil) oc 


. ,, 2^2 IpW?)' 

( 9 ) 

We are most interested in values of x, y appreciably greater than 
their standard errors, which will be about s'j\n, and then we can inte- 
grate (9) approximately with regard to A and yu. and substitute x, y for 
them in factors raised to low powers. Then 


2?w'*-l-n(A— dadXdy. 

2(7* lp(CT* + y>*) 


P(q'da I dH) oc — exp 


n-TT 

'2 J' I 


L 


n is already assumed fairly large. Form a generalized f*, such that 

x^+y^ = 

n — I 

since the number of degrees of freedom is 2»— 2. Then 

valid if i is more than about 2. 


( 13 ) 



318 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 
It is possible to reduce IjK exactly to a single integral. We have 

f e— ^ ^ ^ 2J1 (2m-lK2^ ). ,. l ] 

J ^ \ ^ 2m\ 2m(2m— 2)...2 / 


Put 

then 

P{q' derdp | dH) 
oc 


l^^2»i! 2m(2m— 2). 

==2Ji+y-^\ 

\ ^ 2^"^! m!/ 

' m=*l ' 

fS-f ^2 = r2; 


(14) 

(15) 




u2+p2 




Put now 


X 1 + 


p = at;. 


(16) 

(17) 


^ /wrp\2”* 1 

2-, \2a7 m!m! 

n=l 

cc ^ J ^e»p(-J.„--)exp{-ti?g^j X 

m*- 1 

00 ^ 

= “ J exp(— ^W 2 ; 2 )j^l+ 2 


JL 2 




1 dv 

m!m! 

1 

\2n{2s'^+r^)j 

1-1- t)2 

] dv 

2(2a'2+r*)jl + ?;2 


(19) 


dv 


—n, 1, } 

2(25'2+r2)/ 

l+v^' 

(20) 


If rt = 1 , r = 0 , 5 ' is identically 0 and J{ reduces to 1 as we should 
expect. If n = 0 it is obvious from (19) that K — 1. The resemblance 
to 5.2 (33) is very close. 

If several old parameters have to be determined their effect is similar 
to that found in 5.92; tt will still appear in the outside factor but will 
be replaced by v in the f factor, but in practice it will be sufficiently 
accurate to use v in both faetors. 



§6.2 8IGNIFICAXCE TESTS: VARIOUS COMPLICATIONS 


319 


If there is a predicted standard error we shall have 

K ~ ^«‘/s77exp(— (x > 2). (21) 

This will be apphcable, in particular, when the data are frequencies. 

6.21. Now consider the fitting of a pair of harmonics to a set of n 
measures of equal standard error. The law to be considered is, for the 
rth observation, 

P{dx^\a.,^,ij,H) = -^^l^exp|— Ivacosi,— I:,.i38ing!2|dx,. 

and for comparison with the law for a == ^ = 0 


( 22 ) 


J, = /.‘^(ot cost,. -(-jS sin <,.)*/a2. (23) 

For n observations we take the mean, namely 

^liere ^ iAoc^+2HaP+B^^)|a^ (24) 

nA = 2 GOS%, nH — ^ A'* cos sin nB = ^ sin^t,.. (25) 

In practice the phase of the variation is usually initially unknown 
and the distribution of the observed values t, is irrelevant to it. We 
need a prior probability rule for the amplitude independent of the phase. 
This is obtained if we take the mean of J for variations of phase <f> with 
kept constant; then 

J = (l/2n) N Af(a2+j8*)/a2 = ^{A + B)p^/.j^ (26) 

2 dJ^I^ 1 

P(dacdp\q'aH)^-^--d<l> 

7T l+'A 

(A-\-B)'^'^ adocd^ 


77^'2 + 

,1 da 


(27) 


(28) 


We now find 

P(qda I eH) oc c7-'‘expj--^(«'>=+.4o2+2i/af>+:B62)j^ 

P(q' dadoed^ 1 6H) 

oc a-”exp[-^j5'=*+.4(:.-a)2+2^f(a-o)()3-6)+.B(i8-6)^}] x 

(A + Byi"- dadoidp „ . 

7rW2 ^(oc^+p'^){a^+l(A + B){oc^+^)y ’ 

where a, 6 are the maximum likelihood estimates of a and j8. If ^J(a.^-\-^^) 
is much greater than s'j^n, which is the important case, we can integrate 
approximately with regard to a and jS. Then 


j ns'^\ 


X 


o^da 


+b^){a^-^(A + B)ia^+b^)} 


( 30 ) 



S20 StONlPlOANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

Finally, integrating with regard to a, and putting a = a in factors raised 
to low powers, 

1 -J2I A+B _ Aa^+2HcU)+Bb^yi^^ 
k'^ n7T[AB~Hy \ + s'* / ^ 


V(a*+6*){l + (^+B)(a*+6*)/2s*}' 
Now in finding the least squares solution we get 

J A ^yJ{nA) 


p = b±. 


with 


-^{n{B~HVA)} ) 

V = n—2, rs^ = ns'*; s/y/n = a'/^/v; 
Aa^+2Hab+Bb^ A{a+HblA)^+(B-miA)b^ 


(31) 

(32) 

(33) 


= I b^\ 

" 1 ^a+mu ^bi 

t* 

= -• (34) 

Then 

j^_r^(^H^IA\y^^(a^+b^)(^ , {A + B)(a^+b^)] 

^2\i+B/a) a + 

(35) 

The calculation is simplified by the fact that nA, nH, nB are coefficients 
in the normal equations and n{B—H^IA) is the coefficient of jS after a 
has been eliminated. Hence is found directly from the quantities that 
occur in the solution and their estimated standard errors. 

If A — B, H = 0, which is a fairly common case. 


1 /a* , b^\ 2 

«• »!/ MA+B)‘ • 

(36) 


(37) 


The approximations have supposed that ^(<**+6*) is large compared 
with »/Vn and small compared with a. But a further approximation has 
been made, as we can see by considering the case where H — 0 and A 
is much larger than B. If nB is of order 1, and a is much less than «, 
the variation of the exponential factor with may be less rapid than 



§6.2 SIGNIFICANCE. TESTS: VARIOUS COMPLICATIONS 


321 


that of the factor In this case all the values of are near 

0 or 77. Integrating ivith regard to ^ in these conditions we have 
P{(l' dadix\eH) 


az 




‘) + cr 


dadoL 


^)—o a^{a^+\ix^) 
^{o^-\-\a^)-\rO da 


P(q' da 1 BH) oc exp| -|^^)log 


tiy 

log(8s7®^)\ ' ^1 


, (38) 
, (39) 

(40) 

(41) 


if a/s is small. 

The danger signal is «,, > a > 8„. If n is large and ajs small of order 
7 j,-V'j^ (41) may be smaller than the value given by the direct test for 
one unknown. The smaller value of K represents the fact that J might 
actually be large but that a might be small owing to the observations 
happening to lie near sin< = 0. We may be able to assert with some 
confidence that a periodic variation is present while knowing nothing 
about the coefficient ^ except that it is probably of the same order of 
magnitude as a, but might be of the order of a. The situation will of 
course be very unsatisfactory, but we shall have done as much as we 
can with the data available. The next step would be to seek for observa- 
tions for such other values of t that a useful estimate of )3 can also be 
found, and then to apply (33). 

In the case A = B — H = 0, we can reduce 1/A' again to a single 
integral. The analysis is similar to that leading to 6.2 (20) and gives 


00 


1 

A 


Jo.p(. 






^n(a®-|-6®)i;^l dv 


» (42) 

In the conditions stated n = 1 is impossible. If « = 0, A = 1. If 
» = 2, a' = 0, and again A == 1. This is the case where there are two 
observations a quarter of a period apart. The result is identical with 
6.2 (20) except that \ — \n replaces 1 — n in the confluent hyper- 
geometric function and a® -(-6^ replaces r®. 

There are problems where theory suggests a harmonic disturbance such 
as a forced oscillation with a predicted phase. We are then really testing 
the introduction of one new function, not two, and the rule of 5.9 applies. 
If the disturbance is found we can still test a displacement of phase, 
due for instance to damping, by a further application of 6.9 and 6.92. 

uw.u V 



322 SIGNIFICAXCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

Here the cosine and sine no longer enter on an equal footing because 
previous considerations do not make all phases equally probable on q'. 

6.22. Test of whether two laws contain a sine and cosine with 
the same coefficients. This problem stands to the last in the same 
relation as that of 5.41 to 5.2; I shall not develop the argument in detail 
but proceed by analogy. A J must be defined for the difference of the 
two laws. It is clear that integration with regard to the differences of 
a and jS will bring in a factor n^n^(ni-\-n^ instead of n, and that the 
square root of this factor can be absorbed into the second factor, so 
that the first two factors in (35) will be replaced by 

where A', B', H' are coefficients in the equations used to find the 
differences a 2 — 

The determination of the corrections to a trial position of an earth- 
quake epicentre is essentially that of determining the variation of the 
residuals in the times of arrival of a wave with respect to azimuth. It 
was found in a study of southern earthquakesf (for a different purpose) 
that a few pairs gave epicentres close enough to suggest identity, though 
they were too far apart in time for the second to be regarded as an 
aftershock in the usual sense. On the other hand, cases of repetition 
after long intervals are known, and a test of identity would be relevant 
to a question of whether epicentres migrate over an area. The case 
chosen is that of the earthquakes of 1931 February 10 and 1931 Sep- 
tember 25. If X and y denote the angular displacements needed by 
the epicentre to the south and east, the trial epicentre being 6'3° S., 
102-5 E., the equations found after elimination of the time of occurrence 
from the normal equations were, for 1931 February 10, 

459a:+267y = +33, 

267x+694y = — 11. 

Number of observations 30; sura of squares of residuals 108 sec.®; 
solution ^ ^ 4.0-10°±0-10°, y = -0-06'’±0-08°. 

For 1931 September 26, 

644a:+163y = —36, 

163a:+626«/ = +94. 

Number of observations 36; sum of squares 202 sec.®; solution 
-0-12°±0-10°, y = +0-18°±0-10°. 

t Qeophys. Suppl. 4, 1938, 286. 


X = 



§6.2 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 


323 


The estimated standard errors of one observation are 2-0 sec. and 2-5 
sec., which may be consistent and will be assumed to be. Three para- 
meters have been estimated for each earthquake (cf. 3.52) and hence 
the number of degrees of freedom is 304-35—6 = 59. Then 
a® == (108-f 202)/59 = 5-25; a = 2-3 sec. 

The question is whether the solutions indicate different values of x and 
y for the two earthquakes. It is best not simply to subtract the solutions 
because the normal equations are not orthogonal and the uncertainties 
of X and y are not independent. The null hypothesis is that of identity ; 
if it was adopted we should find x and y by adding corresponding normal 
equations and solving. But if there is a difference we can proceed by 
using suffixes 1 and 2 for the two earthquakes and writing 

*2 = y% = 2/i + 2/'- 

Then x! and y' are the new parameters whose relevance is in question. 
Now we notice that both sets of normal equations can be regarded as 
derived from a single quadratic form 


IF --= i.459a-f-f 267x1^14- J. 694^5—33x1 -fllyi 4- 


-f i . .544(xi+x')®+ 163(xi+x'){yi+y')4-i • 625(yi-f 2/')®+ 

4-36(Xi4-x')— 94(2 /i4-j/'), 

which leads to normal equations as follows: 

1003xi4-544x'4-430yi4-163y' = —3, 
544xi4-544x'4-163yi-f 163/ = —36, 

430xi4-163x'4- 1319^1 4- 625/ = 4-B3, 
163xi4-163x'-f625yi 4-625/ == 4-94. 


Eliminating Xj and j/j we get 

245x'4-108/ = — 29 
108x'4-3272/' = 4-53 


27 9y' = 4-66 


whence the solutions can be taken as 


x'4-0-44/ = —0-12, / 

the uncertainties being independent. Then 
_ 245 x0-12®4-279x 0-24® _ 

~ 5-25 


= 4-0-24. 


x' = —0 12— 0-44 X 0-24 = —0-23, 


/35x30\Vaj 

f 279 ^ 

i'/3,/(0-0534-0-058)/. 

3.73^-286 

!\ 65 / 1 

^14-1-3^ 

1 2-3 \ 

' 59 / ’ 


= 2 - 2 . 


nearly, 



324 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

The odds on the data are therefore about 2 to 1 that the epicentres were 
the same. The further procedure if more accuracy was required would 
be to drop x’ and y' in the normal equations for x and y, and solve for 
the latter by the usual method, revising the residuals to take account 
of the fact that the solution will not be at the least squares solution for 
either separately. 

The following attempt to test the annual j>eriodicity of earthquakes 
is an instance of the necessity to make a thorough test of the indepen- 
dence of the errors before the significance of a systematic variation is 
established. The numbers of determinations of epicentres of earth- 
quakes, month by month, made for the International Seismological 
Summary for the years 1918-33 were kindly supplied to me by Miss E. F. 
Bellamy. These do not represent the whole number of earthquakes 
listed; small shocks observed at only a few stations are given only in 
daily lists, but the fist should be representative of the large and 
moderate shocks, for which all the observations are given in detail. As 
the months are unequal in length a systematic effect was first allowed 
for by dividing each monthly total by the ratio of the length of the 
month to the mean month. The resulting values were rounded to a 
unit, and are as follows. 



Jan. 

Feb. 

Mar. 

Apr. 

May 

June 

July 

Aug. 

Sept. 

Oct. 

Nov. 

Dec. 

Total 

1918 

24 

40 

24 

27 

23 

34 

24 

36 

53 

30 

26 

31 

372 

1019 

18 

17 

23 

18 

30 

22 

43 

37 

55 

33 

13 

12 

321 

1920 

33 

30 

17 

14 

32 

36 

24 

17 

58 

24 

21 

24 

336 

1921 

22 

16 

24 

17 

32 

20 

19 

16 

27 

26 

23 

16 

258 

1922 

22 

23 

19 

32 

26 

31 

23 

32 

32 

17 


31 

310 

1923 

20 

36 

26 

23 

39 

38 

51 

45 

142 

44 

60 

30 

644 

1924 

34 

24 

46 

38 

45 

24 

42 

31 

84 

28 

34 

36 

466 

1926 

36 

50 

36 

36 

54 

56 

49 

39 

32 

28 

26 

36 

478 

1926 

28 

27 

45 

29 

28 

55 

62 

114 

66 

75 

44 

66 

609 

1927 

42 

47 

57 

49 

82 

48 

60 ’ 

64 

51 

66 

57 

40 

663 

1928 

36 

42 

62 

74 

61 

54 

41 

67 

41 

33 

38 

50 

599 

1929 

43 

41 

67 

63 

61 

66 

62 

61 

36 

44 

28 

39 

601 

1930 

24 

37 

57 

44 

83 

41 

68 

40 

67 

80 

66 

66 

653 

1931 

61 

39 

50 

56 

62 

38 

64 

72 

67 

63 

36 

42 

630 

1932 

36 

42 

42 

40 

50 

87 

43 

39 

47 

41 

40 

61 

568 

1933 

39 

54 

76 

52 

1 60 

69 

73 

1 42 

53 

47 

43 

33 

640 

Total 

618 

670 

670 

612 

1 758 

' 1 

719 

728 

! 742 

891 

669 

1 667 

603 

8,047 


There is on the whole a secular increase in the number per year, 
which is mostly due to the increase in the number of stations, many 
earthquakes in the first few years of the period having been presumably 
missed or recorded so poorly that no epicentre could be determined. We 
first compute y® to test proportionality in the chances. It is found to be 
707 on 166 degrees of freedom! No test of significance is needed. There 



16.2 SIGNIFICANCE. TESTS: VARIOUS COMPLICATIONS 326 

are four contributions of over 20; 109 for September 1923, 60 for 
August 1926, 25 for June 1932, and 21 for September 1924. Even apart 
from these extreme cases, remains overwhelmingly large. The only 
years that give anything near the normal expectation are 1921, with 
12-0, and 1922, with 13-4. The immediate result is that the hypothesis 
of independence is seriously wrong; the test has eliminated any periodi- 
city in a year or any submultiple, and any secular change. The obvious 
explanation is that on an average earthquakes occur in groups of 4-3, 
not as separate occurrences. The enormous number in September 1923 
represent aftershocks of the great Tokyo earthquake. It would be of 
little use to reject the years containing the very exceptional months, 
beeau.se the phenomenon is present, to a greater or less extent, in nearly 
every year. 

If the residual variation from month to month was independent we 
might still proceed to determine a pair of Fourier coefficients, allowing 
for the departure from independence within a month by simply multi- 
plying the standard error by 4-3V“ = 2-1. But inspection of the signs 
of the residuals shows that they are not independent. We can test the 
number of persistences and changes of sign against an even chance; 
but there are many small residuals and a slight oscillation among them 
gives numerous changes of sign and reduces the sensitiveness of the test 
greatly. We can recover some of the information lost in this treatment 
by considering only residuals over ±7, thus paying some attention to 
magnitude as well as to sign. There are 55 persistences and 34 changes, 
which, tested against the formula for an even chance, give K = 0-7. But 
the elimination of 27 parameters has introduced 27 changes of sign, and to 
allow for this we must reduce the number of changes by about 13. With 
this modification K is 0-003. Thus the lack of independence extends over 
more than one month, and the standard error found on this hypothesis 
must be multiplied by more than 2-1. The only hope is to make separate 
analyses for each year and examine their consistency. If 6 denotes the 
phase for an annual period, measured from January 16, we get the follow- 
ing results for the coefficients of cos 6 and sin 6 in the monthly numbers. 



cos 

8U1 


COB 

Bin 

1918 

- 2-0 

- 4-8 

1926 

- 15-8 

- 18-8 

1919 

- 13-2 

— 6-3 

1927 

- 8-2 

+ 0-8 

1920 

- 20 

- 3-6 

1928 

— 60 

+ 11-3 

1921 

- 1-0 

+ 0-2 

1929 

- 8-7 

+ 13-8 

1922 

- 3-2 

— 0-2 

1930 

- 3-7 

- 6-7 

1923 

- 16-2 

- 21-8 

1931 

- 70 

- 2-6 

1924 

- 4-7 

- 0-7 

1932 

- 60 

+ 3-2 

1926 

- 6-6 

q - 8-6 

1933 

- 8-7 

- I - 10-5 



326 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VT 

Simple means of the coefficients, with separate determinations of the 
standard errors, give 

( - 6-9± 1 -2)008 e— (0-9± 2-4)sin 6. 

But it is very hard to see how to account for the much greater variation 
of the separate values for the sine than for the cosine coefficient. If we 
pool the two variations to get a general uncertainty the standard errors 
of both coefficients are 1-9, and = 13-3. K is about 0-2. This is 
small enough for us to say that there is substantial evidence for a 
periodicity, but it is not decisive. It remains possible, in fact, that 
a few long series of aftershocks in the summer months are responsible, 
in spite of the consistently negative signs of the coefficients of the cosine; 
though the odds are about 4 to 1 against the suggestion. 

Harmonic analysis applied to the monthly totals for the whole period 
gives terms (— llO-9:j;lO-6)cos0— (18-5d;lO-C)sin0 on the hypothesis 
of independence. The standard error is («/72)’/2, where n is the number 
of observations. Thus for one year the terms would be 
( — 6-9±O-66)cos0-(l-2±O-66)8in0. 

But we know from that the uncertainties must be multiplied by at 
least 4-3*/*i, giving 1-37. The correlation between adjacent months is 
responsible for the rest of the increase. If it had not been for the check 
on independence the above determinations might have been accepted 
without a moment’s hesitation; as it is, they may })erhaps be accepted, 
but certainly with hesitation. 

The Schuster criterion, which is frequently used to test periodicity, 
is really the adapted to two degrees of freedom. It has, however, 

often led to surprising results. C. G. Knott, for instance, worked out 
periodicities in earthquakes corresponding to various periods near a 
month or fortnight, some of which corresponded to some tidal effect 
while others did not. The amplitudes found were about twice the 
Schuster expectation in 7 cases out of S.f Knott therefore expressed 
doubt about their genuineness. For the annual period he found (pp. 
114-16) the maximum in different regions in several different months, 
with an excessive number in December and January, and thus just 
opposite to the above results. 

The present analysis is not altogether satisfactory, because the list 
used has been subject to a certain amount of selection. Thus the 
Japanese (Tango) earthquake of 1927 March 7 produced 1,071 after- 
shocks from March 11 to June 8; of these 532 are given in the 1. 8.8. 

t Physiet of Earthquake Phenomena, 1008, 130-6. 



§6.2 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 327 

in daily lists, but only one is treated in detail and contributes to the 
above totals. Most of them were small. On the other hand, some earth- 
quakes such as the Tokyo earthquake produced long series of large 
aftershocks, which have contributed greatly. There are possibilities 
that some bias might arise in deciding which earthquakes to treat fully 
and which just to mention. But there seems to be no obvious way in 
which this could affect the instrumental records periodically, and the 
interval between the earthquakes and the time when the solutions were 
made for them has gone through all possible phases during the interval 
used. Yet we still have two possible explanations. Primitive earth- 
quakes might be stimulated more readily in summer, or they might be 
equally likely to occur at any time of the year and tend to produce 
more aftershocks in summer. There is no strong theoretical reason 
for either hypothesis. To test them it would be necessary to have a 
means of identifying primitive shocks, for instance by using only earth- 
quakes from new epicentres. Within a single series of aftershocks, that 
of the Tango earthquake, 1 have found no evidence for any failure of 
independence or for periodicity, the data agreeing well with a simple law 
of chance a), where a is a little earlier than the time of the main 
shock, t If this is general the only relevant data to a periodicity would be 
the times of the main shocks and the number of aftershocks in each case. 

Many studies of earthquake frequency do not rest on the I.S.S., which 
is a fairly complete catalogue of the strong and moderate earthquakes, 
but on much less detailed lists. For instance, in a paper by S. Yamaguti,J 
which inspired me to undertake the work of 6.4, it was claimed that 
there was an association between the region of an earthquake and that 
of its predecessor, even when they were in widely different regions. His 
list gave only 420 earthquakes for thirty-two years; the I.S.S. shows 
that the actual number must have been about fifty times this. He was 
therefore not dealing with successors at all; and in three of his eight 
regions the excess of successors in the same region that aftershocks 
must have produced is replaced by a deficiency, which is presumably 
duo to the incompleteness of the catalogue. Thus an incomplete cata- 
logue can lead to the failure to find a genuine effect; but if any human 
bias enters into the selection it may easily introduce a spurious one. 
For these two reasons, non-randomness and possible bias in cataloguing, 
I have great doubts about the reality of most of the earthquake 
periodicities that have been claimed. (Actual examination of the 

t Oerlands BeitrAge z. Oeophysik, 53, 1938, 111-39. 

:f Bull. Earthquake Ilea. Inst., Tokyo, 11 , 1933, 46-68. 



328 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

relations between earthquakes in different regions apparently obtained 
by Yamaguti disclosed no apparent departure from randomness, f and 
the same applied to my rediscussion using the after the excess 

in the same region had been allowed for.) 

6.23. Grouping. It has already been seen that the estimate of the 
uncertainty of the location parameter in an estimation problem, where 
the data have been grouped, is based on the standard deviation of the 
observations without correction for grouping. The same applies, as 
Fisher has again pointed out, to significance tests based on grouped 
data. This follows at once from the formula 5.0 (10). For the chance of 
getting a in a given range, given q and the fact that the data have been 
grouped, will be given by taking (3) with the uncorrected standard 
error; the range of possible variation of a on q' will be got by applying 
the grouping correction to the apparent range, thus, in the standard 
problem of function fitting, replacing s by in the outside 

factor, which will therefore be reduced in the ratio (1— /i“/12s^)’/=; but 
this is trivial. The usual formulae should therefore be used without 
correction for grouping. This agrees with Fisher’s recommendation. 

6,3. Partial and serial correlation. The conditions of intraclass 
correlation merge into those of two still more complicated problems, 
those of partial correlation and serial correlation. In partial correlation 
an observation consists of the values of k variables, a-j,..., ar^., whose joint 
probability density on some law is proportional to exp(— ^IF), where 
IF is a positive definite quadratic function of the a:,. The problem will 
be, from m such sets of observations to estimate the coefficients in W . 
In intraclass correlation we may regard the a:, as having independent 
probability distributions about a variable a;, which itself has a normal 
probabihty distribution about a. Then 

P{dxi...dx^. I a, C7, T, £f) oc dx, J exp ^ 

Integration with regard to gives a joint probability distribution of 
the form considered in partial correlation. It will, however, be sym- 
metrical in the Xg, which is not true in general for partial correlation. 

The theory of intraclass correlation assumes that the observations 
fall into sets, different sets being independent. There is often some 
reason to suppose this, but often the data occur in a definite order, and 
adjacent members in the order may be closely correlated. The extreme 

■f F. J. W. Whipple, M.N.R.A.8, Oeophys. Suppl. 3, 1934, 233-8. 

j Proc. Comb. Phil. Soc. 32, 1936, 441-6. 



§6.3 SIGNIFICANCK TESTS: VARIOUS COMPLICATIONS 329 

case is where the observations refer to a continuous function. We might 
for each integral n choose from a table of random numbers and then 
interpolate to intermediate values by one of the standard rules for 
numerical interpolation. The result is a continuous function and the 
estimated correlation between pairs of values at interval 0- 1 would be 
nearly unity, though the original data are derived by a purely random 
process. Yule pointed out that many astronomical phenomena (to 
which may be added many meteorological ones) can be imitated by the 
following model. Imagine a massive pendulum of long period, slightly 
damped, at which a number of boys discharge pea-shooters at irregular 
intervals. The result will be to set the pendulum swinging in approxi- 
mately its natural period T; but the motion will be jerky. If there is 
a long interval when there are no hits the pendulum may come nearly 
to rest again and afterwards be restarted in a phase with no relation 
to its original one. In this problem there is a true underlying periodicity, 
that of a free undisturbed pendulum. But it will be quite untrue that 
the motion will repeat itself at regular intervals; in fact if we perform 
a harmonic analysis using data over too long an interval the true period 
may fail to reveal itself at all owing to accidental reversal of phase. 
What we have in fact, if we make observations at regular intervals short 
compared with the true period, is a strong positive correlation between 
consecutive values, decreasing with increasing interval, becoming nega- 
tive at intervals from \T to | T, and then positive again. At sufficiently 
long intervals the correlation will not be significant. 

In such a problem each value is highly relevant to the adjacent values, 
but supplementary information relative to any value can be found from 
others not adjacent to it, the importance of the additional information 
tending to zero when the interval becomes large. For a free pendulum, 
for instance, the displacement at one instant would be a linear funetion 
of those at the two preceding instants of observation; but if the error 
of observation is appreciable three adjacent observations would give a 
very bad determination of the period. To get the best determination 
from the data it will be necessary to compare observations at least a 
half-period apart, and it becomes a problem of great importance to 
decide on the best method of estimation. Much work is being done on 
such problems at present, though it has not yet led to a generally 
satisfactory theory. I 

A simple rule for the invariant J can be found in a large class of cases 
where (1) the probability of any one observation by itself is the same 
t Cf. M. G. Kendall, CofUrihiUiont to the Studjf of OacUUUory Time-aeries, 1946. 



330 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 


for both laws, (2) the probability of one observation, given the law and 
the previous observations, depends only on the immediately preceding 
one. We have for the whole series. 


" Z Z ^ dP(x, I x^...x,_^, a, H) 

y{P{Xi \aH)P{x2\XiOcH)...P{x„ — 

— P{x^\(xH)P{x^\Xy(xH)...P(x,, \Xi...x^_i<xH)}. 
The terms containing \ogdP(x^ \ ...) reduce in the conditions stated to 

— P{x^ |a//)...P(x, ,.i |Xi...X,a;//)} 

“ 2 


since x, and earlier values do not appear in the later terms in the 
products, which therefore add up to 1 ; and we can also sum over x, for 
8 < r— 1. We can now sum over and get 


Z ®dP(x,. |x,_i, a,f/) 






-P(x,_^\oJl)P(x,\x, 

By condition (1), I I “^)> 

and therefore this term reduces to 


1’ 


ocji)]. 


where 


Jr = 


2 1 


dP(x,.|x,._^,a' ,//) 

Z ^ dP{x^\x^_^,a,H) 


{P(x, \ x,.i. 


<x,H)-P{x, 


a^r-i, ct, H)]. 


We have to sum over the possible values of x,._j, and then with regard 
to r. Finally, dividing by n as indicated on p. 170, we have a summary 
value of J which can be used as in the case of independent observations. 

For r = I, = O', for r > 1, J,. is simply J for the comparison of the 
two laws with x,._i among the data. 

The simplest case of this type is where each observation is a measure 
and the relation between consecutive measures is of the form 


Xf = 

where all x^, taken separately, have normal probability distributions 
about 0 with standard error a. Then 


T = <7(1— 

and for different values of p, with <7 fixed, J, is the same as for com- 



§6.3 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 


331 


parison of two normal laws with true values and standard 

errors or(l— ( t( 1 — Then J^{r > 1) follows from 3.9 (15): 


But 

Hence 


2 *4 f -4 

,(p'—p)^(l+PP') 


= (n-I) 




J = 


(I__p2)(l_p-2)<P 


This is identical with J for the comparison of two congelations, the 
standard errors being given. 

The joint likelihood for n observations is 

1 

(27r)'^“"CT'‘(l — p2jV'.>(n-l) 


X exp 


Ja* — p*’) 


(27r)‘/2"a"(l — 

X exp|^-^-^i^^{xf-2pXiX2+(l+p2)x“-...+x*}J n 


The interesting mathematical properties of ./ in this problem suggest 
that it might be used, but there are obvious difficulties. One is similar 
to what we have twice noticed already. If the suggested value of p is 1, 
and p' has any value other than 1, J is infinite, and the test fails. The 
estimation rule gives a singularity not only at p = 1, which might be 
tolerable, but also at — 1, which is not. If the correlation is p, for values 
of a function taken at equal intervals, say 1, we might tr}' to estimate 
p from observations at intervals 2. The correlation at interval 2 would 
be p*. The same method would apply, but J would be seriously changed 
if we replaced p by p* in it. 



332 SIGNIFICANCE TESTS: VABIOUS COMPLICATIONS Chap. VI 

On account of the asymmetry for the first and last observations there 
are no sufficient statistics, but a nearly sufficient pair will be 

n—1 

r— 1 

Kxf+4)+”l a;? 

r- 2 

This problem is given only as an illustration. In actual cases the corre- 
lation will usually run over several observations, effectively an infinite 
number for a continuous function, and the procedure becomes much 
more complicated. Further, the law itself may differ greatly from 
normality. I have had two cases of this myself where the problem was to 
estimate a predicted nearly periodic variation and the observations were 
affected by non-normal errors with a serial correlation between them.'j' 
A completely systematic procedure was impossible in the present state 
of knowledge, but approximate methods were devised that appeared 
fairly satisfactory in the actual problems considered. 

My impression is that, though the use of J gives rules for the prior 
probability in many cases where they have hitherto had to be guessed, it 
is not of universal appUcation. It is sufficiently successful to encourage 
us to hope for a general invariance rule, but not successful enough to 
make us think that we have yet found it. I think that the analysis 
of partial correlation should lead to something more satisfactory. 

In problems of continuous variation with a random element the 
ultimate trouble is that we have not yet succeeded in stating the law 
properly. The most hopeful suggestion hitherto seems to be Sir G. I. 
Taylor’s theory of diffusion by continuous movements, J which has been 
extensively used in the theory of turbulence. At least, by taking corre- 
lations between values of a variable at any time-interval, it avoids the 
need to consider a special time-interval as fundamental. 

6.4. Contingency affecting only diagonal elements. In the simple 
2x2 contingency table we have a clear-cut test for the association of 
two ungraduated properties. In normal correlation we have a case 
where each property is measurable and the question is whether the 
parameter p is zero or not, and to provide an estimate of it if it is not. 
Rank correlation is an extension to the case where the properties are 
not necessarily measurable, but each can be arranged in a sequence of 
increasing intensity, and the question is whether they tend to be 
specially associated near one line in the diagram, usually near a diagonal 

t M.N.R.A.S. 100, 1940, 13&-5.''i; 102, 1942, 194-204. 

% Proc. Land. Math. Soc. (2) 20, 1922, 196-212. 




§6.4 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 333 

of the table. The amounts of the displacements from this line are 
relevant to the question. A more extreme case is where, on the hypo- 
thesis q', only diagonal elements would be affected. The distinction 
from the case of rank correlation may be illustrated by a case where the 
two orders are as follows: 

Z Y X-Y 

1 2 —1 

2 1 +1 

3 4-1 

4 3 -f 1 

5 6 —1 

6 5 -f 1 

7 8 —1 

8 7 -f 1 

The rank correlation is 1 — 48/.'>04 — 0-905. Yet not a single member 

occupies the same place in the two orders. We can assert a close general 
correspondence without there being absolute identity anywhere. But 
there are cases where only absolute identity is relevant to the question 
under test. Such a case has been discussed by W. L. Stevens, f namely 
that of the alleged telepathic recognition of cards. Evidence for the 
phenomenon would rest entirely on an excess number of cases where the 
presentation and identification refer to the same card; if the card pre- 
sented is the king of spades, the subject is equally wrong whether 
he identifies it as the king of clubs, the queen of spades, or the two of 
diamonds. (I am not sure whether this is right, but it is part of the 
conditions of the problem.) Another ca.se is the tendency of an earth- 
quake in a region to be followed by another in the same region ; to test 
such a tendency we cannot use rank correlation because the regions 
cannot bo arranged in a single order. The known phenomenon is that 
a large earthquake is often followed by a number of others in the same 
neighbourhood; but to test w'hether this is an accidental aasociation or 
not we must regard any pair not in the same region as unconnected, 
whether the separation is 2,000 or 20,000 km. Only successors in the 
same region are favourable to the suggested association, and we have 
to test whether the excess of successors in the same region is large 
enough to support the suggestion that one earthquake tends to stimu- 
late another soon after and at a small distance. 

In the earthquake problem, which may be representative of a large 

t Ann. Eugtn. 8, 1938, 238-44. 



334 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 


number of others, given that the last earthquake was in a particular 
region, the probability that the next will be in that region and stimu- 
lated by it is a, which we may take to be the same for all earthquakes. 
On hypothesis q, at will be 0. The chance at any time that the next 
earthquake will be in the rth region is p^. On the hypothesis of random- 
ness the chance that the next will be in region r and the next but one 
in region s will hep^p^, where all the p’s will have to be found from the 
data. On hypothesis q\ the chance that an earthquake will be in region 
r and followed by one stimulated by it will be p^cx, leaving p,.(l — a) to 
be distributed in proportion to the p, (including s = r since we are not 
considering on q' that the occurrence of an earthquake in a region pre- 
cludes the possibility that the next will be an independent one in the 
same region). Thus the joint chance will be (1 — a)pr7>s> except for s = r, 
for which it is (1— Proceeding to the third and neglecting 
any influence of an earthquake other than its immediate predecessor, 
the joint chance of all three will be obtained by multiplying these ex- 
pressions by (1— a)p< if t ^ s, and by (l-~a)pg-t-a if < = s. So we may 
proceed. The joint chance of a set of earthquakes, in a particular order, 
such that in x„ cases an earthquake in region r is followed by one in 
region s, for all values of r and a, is 



(1 J • 

(1) 

where 

^r= I W = V 3-^, 

(2) 




and the last factor is the product over all repetitions. Then this is 
I Pr> “> P)- P{^ I y.Pr. P) is got by putting a = 0. 

The invariant J for comparison of q and q' can be found by the method 
of 6.3. We have, if the {m — l)th observation is in region r, 

'4 = 2 + 

+ IT iog(l— a){(l-a)p,— p,} 

r e 

= ^ iog^l — cx + ~ja(l—Pr)— 21og(l — a:).a(l— Pr) 

where m is the number of regions. J is infinite if a 1, corresponding 



8IGNIFICANCE.TESTS: VARIOUS COMPLICATIONS 


335 


to the case where, if an earthquake is in a given region, the next is 
certain to be in that region. J is also infinite if for some r, 

(1— = 0, 

corresponding to the case where a is negative and sufficiently large 
numerically for the occurrence of an earthquake in some region to 
inhibit the occurrence of the next in that region. This might conceivably 
be true, since we could contemplate a state of affairs where an earthquake 
relieves all stress in the region and no further earthquake can occur until 
the stresses have had time to grow again; by which time there will almost 
certainly have been an earthquake somewhere else. It is therefore worth 
while to consider the possibility of negative <x. For a significance test, 
however, it is enough to have an approximation for a small and we shall 

take 1 

P(d<x\p^...p^H) = -7(m— l)da. 

TT 

The interpretation of the factor in m. is that our way of stating the 
problem does not distinguish between different parts of a region. An 
earthquake in it may stimulate one in another part of the region, which 
will be reckoned as in a different region if the region is subdivided, and 
hence subdivision will increase the concentration of the probability of 
a towards smaller values. 

The solution is now found as usual; the factors depending on p^ are 
nearly the same in both P{q \ OH) and P(q' \ 6H), and we can substitute 


the approximate values 


Pr = 


in the factors that also involve a. Then 


v'(m-l) 


nf * \ nfi rt 


F + a^r“r 


and expand the logarithm of the integrand to order a® and a, a. We find 
after reduction 

— = j* exp[Wa! 2 a,— — I)W] da. 

== J exp{— J(r«— l)W(a— a)*+J(7n— l)Aa2)da, 


where 


ttv A 

K = y|^jexp{-^(TO-l)Wa*}. 



336 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 


If K is small we shall have 

_ „ , 1 

The following table was compiled from the Internatkmal Seismo- 
logical Summary from July 1926 to December 1930. The earthquakes 
used were divided into ten regions; eight earthquakes in Africa were 
ignored because they were too few to be of any use. In some cases, 
also, several widely different epicentres would fit the few observations 
available, and these also were ignored. Thus the table is limited to 
fairly well observed earthquakes, which are only a fraction of those 
that actually occur. The North Pacific in west longitude was included 
with North America; the Eastern North Pacific was divided between 
Japan (with the Loo-Choo Islands and Formosa) and the Philippines; 
the East Indies were included with the South Pacific; the West Indies 
with Central America; and the Mediterranean region and the north 
eoast of Africa with Europe. The results are as follows: 


Second 

First 

Europe 

Asia 

Indian 

Ocean 

Japan 

S 

e 

1 

£ 

South 

Pacific 

North 

America 

Central 

America 

South 

America 

Atlantic 

Total 

“r 

Europe 

97 

68 

11 

73 

12 

60 

22 

22 

23 

19 

397 

+ 0-092 

Asia . 

69 

119 

13 

93 

21 

66 

16 

20 

22 

15 

444 

+ 0-098 

Indian Ocean 

10 

17 

8 

23 

4 

10 

6 

3 

6 

2 

88 

+ 0-067 

Japan 

84 

90 

21 

179 

22 

82 

24 

36 

26 

26 

590 

+ 0-077 

Philippines 

8 

18 

4 

31 

33 

22 

6 

6 

8 

4 

139 

+0-184 

South Pacific 

67 

62 

14 

81 

17 

lie 

22 

16 

22 

19 

425 

+ 0-107 

North America . 

17 

18 

3 

32 

6 

18 

21 

6 

6 

5 

132 

+ 0-108 

Central America. 

16 

28 

4 

26 

6 

22 

2 

16 

10 

2 

131 

+ 0-072 

South America . 

29 

19 

4 

33 

9 

27 

7 

4 

24 

1 

167 

+ 0-092 

Atlantic . 

10 

16 

6 

19 

10 

13 

8 

2 

10 

8 

101 

+ 0-041 










i 

2604 

+ 0-928 


Here m = 10, W = 2604, 2 «r = 0-928. Then 

K = l-6xl0-^». 


The evidence for q' is therefore overwhelming. The estimate of a is 

a = 4-0-1031i:00065. 

This can be interpreted as the chance that a given earthquake will be 
followed by an aftershock, strong enough to be widely recorded, before 
there has been another widely recorded earthquake anywhere else. 

6.5. Deduction as an approximation. We have seen that in 
significance tests enormous odds are often obtained against the null 











J6.6 SIGNIFICANC® TESTS: VARIOUS COMPLICATIONS 


337 


hypothesis, but that those obtained for it are usually much smaller. 
A large discrepancy makes K exponentially small, but even exact agree- 
ment with the predictions made by the null hypothesis only makes K 
of order But a small K does not establish the hypothesis q'. It 
only shows that the hypothesis that one new parameter is needed, the 
rest of the variation being regarded as random, is more probable than 
that the whole variation is random. It does not say that no further 
parameter is still needed. Before we can actually attach a high proba- 
bility to q' in its present form we must treat it as a new q and test 
possible departures from it; and it is only if it survives these tests that 
it can be used for prediction. Thus when a hypothesis comes to be 
actually used, on the ground that it is ‘supported by the observations’, 
the probability that it is false is always of order which may be as 
large as 0-2 and will hardly ever be as small as 0-001 . Strictly, therefore, 
any inferences tViat we draw from the data should not be the inferences 
from q alone but from q together with all the alternatives that have 
been considered but found not to be supported by the data, with 
allowance for their posterior probabilities. If, for instance, x denotes 
the proposition that some future observation will lie in a particular 
range, and we consider a set of alternative hypotheses q^, q^,..., we shall 
have 

P{x 1 0 ^) = 2 \m = l P{^ \ qJH)P{q, \ OH). 

Now if in a given case one of the hypotheses, q say, has a high proba- 
bility on the data, and all the others correspondingly small ones, 
P{x I QH) will be high if x has a high probability' on q. If x has a low 
probability on q, its probability will be composed of the small part 
from q, representing the tail of the distribution of the chance on q, and 
of the various contributions from the other q^. But the last together 
make up q', and the total probability of all such values cannot exceed 
the posterior probability of q'. Thus the total posterior probability 
that the observation will be in a range improbable on q will be small. 
In our case the situation is more extreme, for the q^ will be statements 
of possible values of a parameter a, which we may take to be 0 on q. 
But when K is large nearly all the total probability of q' comes from 
values of a near the maximum likelihood solution, which itself is small 
and will give therefore almost the same inferences as q. The only effect 
of q’ is to add to the distribution on q another about nearly the same 
maximum and with a slightly larger scatter and a smaller total area. 
Thus the total distribution on data dH is practically the same as on qdH 
alone; the statement of 6 takes care of the uncertainties on the data of 

atu.ta 7 



338 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 


the paxameters that are relevant on q. Thus if q has been found to be 
supported by the data we can take as a good approximation 

Pix\dH) = P(x\qeH), 

thus virtually asserting q and neglecting the alternatives. We have in 
fact reached an instance of the theorem of 1.6, that a well-verified 
hypothesis will probably continue to lead to correct inferences even if 
it is wrong. The only alternatives not excluded by the data are those 
that lead to almost the same inferences as the one adopted. The 
difference from the inferences in a simple estimation problem is that 
the bulk of the probability distribution of a. is concentrated in a — 0 
instead of being about the maximum likelihood solution. 

This approximation means an enormous practical convenience. In 
theory we never dispose completely of q', and to be exact we should 
allow for the contributions of all non-zero values of a. in all future 
inferences. This would be hopelessly inconvenient, and indeed there 
is a limit to the amount of calculation that can be undertaken at all — 
another imperfection of the human mind. But it turns out that we 
need not do so; if X has been greater than 1 for all suggested modifica- 
tions of q we can proceed as if q was true. At this stage science becomes 
deductive. This, however, is not a virtue, and it has nothing to do 
with pure logic. It is merely that deduction has at last found its 
proper place, as a convenient approximation to induction. However, 
at this stage all parameters in q now acquire a permanent status (at any 
rate until further observation shows, if ever, that q was wrong after 
all). Planetary theory, for instance, involves associating with each 
planet a certain quantity, which remains unchanged in predicting all 
observations. It is convenient to give this a definite name, mass. This 
process occurs at a much more elementary stage of learning. Whenever 
we find a set of properties so generally associated that we can infer that 
they will probably be associated in future instances, we can assert their 
general association as an approximate rule, and it becomes worth while 
to form the concept of things with this set of properties and give them 
a name. For scientific purposes reality means just this. It is not an 
a priori notion, and does not imply philosophical reality, whatever that 
may mean. It is simply a practical rule of method that becomes con- 
venient when we can replace an inductive inference approximately by 
a deductive one. The possibility of doing it in any particular case is 
based on experience. Thus deduction is to be used in a rather Pick- 
wickian sense. It no longer claims to make inferences with certainty, 



fe.5 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS 


339 


for three reasons. The law used may be wrong; even if right, it contains 
parameters with finite uncertainties on the data, and these contribute 
to the uncertainty of predictions; and the prediction itself is made with 
a margin of uncertainty, expressing the random error of the individual 
observation. 

It is worth while to devote some attention to considering Jiow a law, 
once well supported, can be wrong. A new parameter rejected by a 
significance test need not in fact be zero. All that we say is that on the 
data there is a high probability that it is. But it is perfectly possible 
that it is not zero but too small to have been detected with the accuracy 
yet attained. We have seen how sueh small deviations from a law may 
be detected by a large sample when they would appear to have been 
denied by any sub-sample less than a eertain size, and that this is not 
a contradiction of our general rules. But the question is whether we can 
allow for it by extending the meaning of g so as to say that the new 
parameter is not 0 but may be anywhere in some finite range. This 
might guard against a certain number of inferences stated with an 
accuracy that further work shows not to be realized. I think, however, 
that it is both impossible and undesirable. It is impossible because q 
could not then be stated; it would need to give the actual limits of the 
range, and these by hypothesis are unknown. Such limits would be a 
sheer guess and merely introduce an arbitrariness. Further, as the 
number of observations increases, the accuracy of an estimate also 
increases, and we cannot say in advance what hmit, if any, it can reach. 
Hence if we suggest any limit on q it is possible that with enough 
observations we shall get an estimate on q' that makes nearly the whole 
chance of a lie within those limits. What should we do then ? K would 
be in the ratio of the ranges permitted on q' and q. Should we be satis- 
fied to take the solution as it stands, or should we set up a new q that 
nobody has heard of before with a smaller range ? I think that the latter 
alternative is the one any scientist would adopt. The former would say 
that the estimate must be accepted whether we adopt q or q'. But it 
is just then that we should think that the reason we have got a doubtful 
value within the range on q is that we took the range too large in the 
first place ; and the only way of guarding against such a contradiction 
is to take the range on q zero. If there is anything to suggest a range 
of possible values it should go into the statement of q', not of q. 

Possible mistakes arising from parameters already considered and 
rejected being in fact not zero, but small compared with the critical 
value, can then be corrected in due course when enough information 



340 SIGNIFICANCE TESTS: VARIOUS COMPLICATIONS Chap. VI 

becomes available. If we try to guard against it in advance we are 
not giving the inference from the data available, but simply guessing. 
If > 1, then on the data the parameter probably is zero; there is no 
intelligible alternative. It does not help in the least to find out that a 
parameter is 0-1 if we say that it may not be 0 when the estimate is 
0'6d;0*5. All that we can say is that we cannot find out that it is not 0 
until we have increased our accuracy, and this is said with sufficient 
emphasis by making the posterior probability of q high but not 1 . 

A new parameter may be conspicuous without being very highly 
significant, or vice versa. A 5 to 0 sample appears striking evidence at 
first sight, but it only gives odds of 16 to 3 against an even chance. The 
bias in Weldon’s dice experiments is hardly noticeable on inspection, 
but gives odds of about 1,600 to 1. With a small number of observa- 
tions we can never get a very decisive result in sampling problems, and 
seldom get one in measurement. But with a large number we usually 
get one one way or the other. This is a reason for taking many observa- 
tions. But the question may arise whether anomalies that need so 
many observations to reveal them are worth taking into account any- 
how. In Weldon’s experiments the excess chance is only 0 0044, and 
would be less than the standard error if the number of throws in a 
future trial is less than about 10,000. So if we propose to throw dice 
fewer times than this we shall gain little by taking the bias into account. 
Still, many important phenomena have been revealed by just this sort 
of analysis of numerous observations, such as the variation of latitude 
and many small parallaxes in astronomy. The success of Newton was 
not that he explained all the variation of the observed positions of the 
planets, but that he explained most of it. The same applies to a great 
part of modern experimental physics. Where a variation is almost 
wholly aecounted for by a new function, and the observations are 
reasonably numerous, it is obvious on inspection and would also pass 
any significance test by an enormous margin. This is why so many 
great advances have been made without much attention to statistical 
theory on the part of their makers. But when we come to deal with 
smaller effects an accurate analysis becomes necessary. 



VII 


FREQUENCY DEFINITIONS AND DIRECT METHODS 

Lord Manafiold gave the following advice to the newly-appointed Governor 
of a West India Island. ‘There is no difficulty in deciding a case — only hear 
both sides patiently, then consider what you think justice requires, and decide 
accordingly; but never give reasons, for your judgment will probably be 
right, but your reasons will certainly be wrong.’ 

A. H. Engelbach, More Anecdotes of Bench and Bar. 

7.0. Most of current statistical theory, as it is stated, is made to appear 
to depend on one or other of various definitions of probability that 
claim to avoid the notion of tlegrees of reasonable belief. Their object 
is to reduce the number of postulates, a very laudable aim; if this 
notion could be avoided our first axiom would be unnecessary. My 
contention is that this axiom is necessary, and that in practice no 
statistician ever uses a frequency definition, but that all use the notion 
of degree of reasonable belief, usually without even noticing that they 
are using it and that by using it they are contradicting the principles 
they have laid down at the outset. 1 do not offer this as a criticism 
of their results. Their practice, when they come to specific applications, 
is mostly very good; the fault is in the precepts. 

7.01. Three definitions have been attempted: 

1 . If there are n possible alternatives, for m of which p is true, then 
the probability of is defined to be min. 

2. If an event occurs a large number of times, then the probability 
of p is the limit of the ratio of the number of times when p will be 
true to the w'hole number of trials, when the number of trials tends 
to infinity. 

.3. An actually infinite number of jjossible trials is assumed. Then 
the probability of p is defined as the ratio of the number of cases where 
p is true to the whole number. 

The first definition is sometimes called the ‘classical’ one, and is 
stated in much modern work, notably that of J. Neyman.f The second 
is the Venn limit, its chief modem exponent being R. Mises.J The 
third is the ‘hypothetical infinite population’, and is usually associated 
with the name of Fisher, though it occurred earlier in statistical 
mechanics in the writings of Willard Gibbs, whose ‘ensemble’ still plays 

t Phil. Trans. A, 236, 1937, 333-80. 

J Wahrscheinlicliktit, Statistik und Wahrheit, 1928; Wahrschemlichkeiterechnung, 1931. 



342 FBEQUENCY DEFINITIONS AND DIBECT METHODS Chap. VII 

a ghostly part. The three definitions are sometimes assumed to be 
equivalent, but this is certainly untrue in the mathematical sense. 

7.02. The first definition appears at the beginning of De Moivre’s 
book.t It often gives a definite value to a probability; the trouble is 
that the value is often one that its user immediately rejects. Thus sup- 
pose that we are considering two boxes, one containing one white and 
one black ball, and the other one white and two black. A box is to be 
selected at random and then a ball at random from that box. What 
is the probability that the ball will be white ? There are five balls, two 
of which are white. Therefore, according to the definition, the prob- 
ability is But most statistical writers, including, I think, most of 
those that professedly accept the definition, would give = 12 . 

This follows at once on the present theory, the terms representing two 
applications of the product rule to give the probability of drawing each 
of the two white balls. These are then added by the addition rule. 
But the proposition cannot be expressed as the disjunction of 5 alter- 
natives out of 12. My attention was called to this point by Miss J. 
Hosiasson. 

On such a definition, again, what is the probability that the son of 
two dark-eyed parents will be dark-eyed? There are two possibilities, 
and the probability is A geneticist would say that if both parents 
had one blue-eyed parent the probability is | ; if at least one of them 
is homozygous it is 1. But on the definition in question, until the last 
possibility is definitely disproved, it remains possible that the child will 
be blue-eyed and there is no alternative to the assessment J. The 
assessment | could be obtained by the zygote theory and the defini- 
tion, but then again, why should we make our definition in terms of 
a hypothesis about the nature of inheritance instead of the observable 
difference ? If it is permitted to use such a hypothesis the assessment 
ceases to be unique, since it is now arbitrary what we are to regard as 
‘alternatives’ for the purpose of the definition. 

Similarly, the definition could attach no meaning to a statement that 
a die is biased. As long as no face is absolutely impossible, the prob- 
ability that any particular face will appear is ^ and there is no more 
to be said. 

The definition appears to give the right answer to such a question 
as ‘What is the probability that my next hand at bridge will contain 
the ace of spades?’ It may go to any four players and the result is 
J, But is the result, in this form, of the slightest use ? It says nothing 

I Doctrine 0 / Chancee, 1738. 



FREQUENCY DEFINITIONS AND DIRECT METHODS 


343 


§ 7.0 


more — in fact rather less — than that there are four possible alternatives, 
one of which will give me the ace of spades. If we consider the result 
of a particular deal as the unit ‘case’, there are 62!/{13!)* possible deals, 
of which 51!/12!(13!)* will give me the ace of spades. The ratio is J as 
before. It may appear that this gives me some help about the result 
of a large number of deals, but does it? There are {52!/(13!)‘‘}™ possible 
sets of n deals. If and are two integers less than n, there are 

m=7ni " ^ ' 

possible sets of deals that will give me the ace from TOj to times. 
Dividing this by the whole number of possible sets we get the binomial 
assessment. But on the definition the assessment means this ratio and 
nothing else. It does not say that I have any reason to suppose that 
I shall get the ace of spades between Ja±^(3w)'fe times. This can be said 
only if we introduce the notion of what is reasonable to expect, and 
say that on each occasion all deals are equally hkely. If this is done 
the result is what we want, but unfortunately the whole object of the 
definition is to avoid this notion. Without it, and using only pure 
mathematics and ‘objectivity’, which has not been defined, I may get 
the ace of spades anything from 0 to » times, and there is no more to 
be said. Indeed, why should we not say that there are n -\- 1 possible 
cases, of which those from to are mj— TOj-|- 1, and the probability 
that 1 shall get the ace of spades from mj to times is 

(mg— w,+l)/(n+l)? 

Either procedure would be legitimate in terms of the definition. The 
only reason for taking the former and not the latter is that we do con- 
sider all deals equally likely, and not all values of m. But unfortunately 
the users of the definition have rejected the notion of ‘equally likely’, 
and without it the result is ambiguous, and also useless in any case. 

For continuous distributions there are an infinite number of possible 
cases, and the definition makes the probability, on the face of it, the 
ratio of two infinite numbers and therefore meaningless. Neyman and 
Cramer try to avoid this by considering the probability as the ratio of 
the measures of sets of points. But the measure of a continuous set is 
ambiguous until it is separately defined. If the members can be specified 
by associating them with the values of a continuous variable x, then 
they can be specified by those of any monotonic function f{x) of that 
variable. The theory of continuity does not specify any particular 



344 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

measure, but merely that some measure exists and therefore that an 
infinite number of possible measures do. and f{x^—f(Xi) are 

both possible measures of the interval between two points, and are not 
in general in proportion. We cannot speak of the value of a probability 
on this definition until we have specified how the measure is to be 
taken. A pure mathematician, asked how to take it, would say: ‘It 
doesn’t matter; I propose to restrict myself to theorems that are true 
for all ways of taking it. ’ But unfortunately the statistician docs not 
so restrict himself; he decides on one particular way, his theorems would 
be false for any other, and the reason for choosing that way is not 
explained. It is not even the obv'ious way. Where :r is a continuous 
variable it would seem natural to take the interval between any two 
points as the measure, and if its range is infinite the probability for 
any finite range would be zero. The assessment for the normal law of 
error is not taken as the interval but as the integral of the law over 
the interval, and this integral becomes a probability, in the sense stated, 
only by deriving the law in a very circuitous way from the dubious 
hypotheses used to explain it. The measure chosen is not the only one 
possible, and is not the physical measure. But in modern theories of 
integration the measure does appear to be the physical measure; at 
any rate pure mathematicians are willing to consider variables with an 
infinite range. 

Even where the definition is unambiguous, as for the cases of dice- 
throwing and of the offspring of two heterozygous parents, its users 
would not accept its results. They would proceed by stating some limit 
of divergence from the most probable result and rejecting the hypo- 
thesis if the divergence comes beyond this limit. In these two cases 
they would, in fact, accept the experimental results. But this is a coir- 
tradiction. The definition is a mathematical convention involving no 
hypothesis at all except that a certain number of cases are possible, 
and the experimental results show that these cases have occurred; the 
hypothesis is true. Therefore the original assessment of the probability 
stands without alteration, and to drop it for any other value is a con- 
tradiction. Therefore I say that this definition is never used even by 
its advocates; it is set up and forgotten before the ink is dry. The 
notion that they actually use is not defined; and as the results obtained 
are closely in agreement with those given by the notion of reasonable 
degree of belief the presumption, until more evidence is available, is 
that this notion is used unconsciously. 

Of all the theories advocated, it is the upholders of this one that 



I 7.0 FREQUENCY DEFINITIONS AND DIRECT METHODS 345 

insist most on mathematical rigour, and they do, in fact, appear mostly 
to have a considerable command of modem mathematical technique. 
But when the assessments have to be made by some principle not stated 
in the definitions, and are often flatly contradictory to the definitions, 
and when the application of the final result requires an interpretation 
different from that given by the definitions, the claim that the elaborate 
use of e, o(n~’^-), and ‘almost everywhere’ in the intermediate stages adds 
anything to the rigour is on the same level as a claim that a building is 
strengthened by fastening a steel tie-beam into plaster at each end. 

7.03. With regard to the second and third definitions, we must 
remember our general criteria with regard to a theory. Does it actually 
reduce the number of postulates, and can it be applied in practice? 
Now these definitions plainly do not satisfy the second criterion. No 
pi’obability has ever been assessed in practice, or ever will be, by 
counting an infinite number of trials or finding the limit of a ratio in 
an infinite series. Unlike the first definition, which gave either an 
unacceptable assessment or numerous different assessments, these two 
give none at all. A definite value is got on them only by malcing a 
hypothesis about what the result would be. The proof even of the 
e.\istence is impossible. On the limit definition, without some rule 
restricting the possible orders of occurrence, there might be no limit 
at all. The existence of the limit is taken as a postulate by Mises, 
whereas Venn hardly considered it as needing a postulate. f Thus there 
is no saving of hypotheses in any case, and the necessary existence of 
the limit denies the possibility of complete randomness, which would 
permit the ratio in an infinite series to tend to no limit. The postulate 
is an a priori statement about possible exjieriments and is in itself 
objectionable. Using the infinite population, any finite probability is 
the ratio of two infinite numbers and therefore is indeterminate, Thus 
those definitions are useless for our pui’pose because they do not define; 
the existence of the quantity defined has to be taken as a postulate, 
and then the definitions tell us nothing about its value or its properties, 
which must be the subject of further postulates. From the point of 

t Cf. R. Lealio EllU, Camb. Phil. Trans. 8, 1840, 2. ‘For myself, after giving a painful 
degree of attention to the point, I have been unable to sever the judgment that one event 
is more likely to happen than another, or that it is to be expected in preference to it, 
from the belief that in the long nm it will occur more frequently.’ Consider a biased 
coin, whore we have no infonnation about which way the bias is until we have experi- 
mented. At the outset neither a head nor a tail is more likely than the other at the 
first throw. Therefore, according to the statement, in a long series of throws heads 
and tails will occur equally often. This is false whichever way the bias is. 

t W. Burnside, Proc. Camb. Phil. Soc. 22 , 1926, 729-7; Phil. Mag. I, 1926, 670-4. 



346 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

view of reducing the number of postulates they give no axivantage over 
the use of chance as a primitive notion; their only purpose is to give 
a meaning to chance, but they never give its actual value because the 
experiments contemplated in them cannot be carried out, and the 
existence has no practical use without the actual value. In practice 
those who state them do obtain quantitative results, but these are never 
found in terms of the definition. They are found by stating possible 
values or distributions of chance, applying the product and addition 
rules, and comparing with observations. In fact the definitions appear 
only at the beginning and are never heard of again, the rest of the work 
being done in terms of rules derivable from the notion of reasonable 
degree of belief ; the rules cannot be proved from the definitions stated 
but require further postulates. 

The Venn limit and the infinite population do not involve the incon- 
sistency that is involved in the first definition when, for instance, bias 
of dice is asserted; since thej'^ do not specify a priori what the limit or 
the ratio must be, they make it possible to alter the estimate of it 
without contradiction. Venn.f considering the product rule, stated it 
in terms of ‘cross-series’. If we consider an infinite .series of propositions 
all entailing r, P{p j r) and P{pq | r) would be defined by the limits of 
ratios in this series, but P(q jpr) requires the notion of an infinite series 
all implying p and r, and of a limiting ratio for the cases of q in this 
series. If the series used is the actual one used in assessing P(p ] r), 
the product rule follows by algebra; but that does not prove that all 
series satisfying p and r will give the same fimiting ratio for q, or indeed 
any limit. The existence of the limit and its uniqueness must be 
assumed separately in every instance. Mises takes them as postulates, 
and the question remains whether to take them as postulates is not 
equivalent to denying the possibility of randomness. With the defini- 
tion in terms of an infinite population the product rule cannot even be 
proved in the limited sense given by the Venn definition, and must 
be taken as a separate postulate. Thus both definitions require the 
existence of probabilities and the product rule to be taken as postulates, 
and save no hypotheses in comparison with the treatment based on the 
notion of degree of reasonable belief. The value of the quantity defined 
on them cannot be found from the definitions in any actual case. 
Degree of reasonable belief is at any rate accessible, and at the least 
it provides some justification of the product rule by pointing to a class 
of cases where it can be proved. 

t The hogic of Chance, 1866, pp. 162 et »eq. 



§ 7.0 FREQUENCY DEFINITIONS AND DIRECT METHODS 347 

It is proved in 2.13 that, in specified conditions, the limit probably 
exists. But this proof is in terms of the notion of degree of reasonable 
belief and must be rejected by anybody that rejects that notion. He 
must deal with the fact that in terms of the definition of randomness 
the ratio may tend to any limit or no limit, and must deal with it in 
terms of pure mathematics. 

Fisher’s definition becomes workable if the infinite population is 
replaced by a large finite population. The addition and product rules 
could then be proved. The difficulty that the possible ratios would 
depend on the number in the population would be trivial if the popula- 
tion is large compared with the sample; the trouble about the infinite 
population is that it is precisely when it becomes infinite that the ratios 
become indefinite. Such a definition avoids the difficulty of the De 
Moivre definition about the different jMissible ways of stating the imit 
alternatives. The numbers in the population would be defined as those 
that would be obtained, in the conditions of the experiment, in the 
given number of trials, and might well be unique. But there would 
still be some difficulties, since the actual set of observations would still 
have to be regarded as a random sample from the population, and the 
notion of ‘equally probable’ would enter through the notion of random- 
ness; it is also doubtful whether this notion could be applied validly 
to wliat must in any case be the first sample. 

7.04. It appears to be claimed sometimes that the three definitions 
are equivalent. This is not so. For dice-throwing the first gives the 
chance of a 5 or a 6 unambiguously as J; but the users of all three would 
usually adopt the experimental result as an approximation, and it is 
appreciably larger — at any rate they would expect the limit in an 
indefinitely extended series to be more than The first and second 
definitions can be made equivalent only by assuming the existence of 
the limit and then treating the experimental result as irrelevant to its 
value. It is also sometimes stated that it is known by experiment that 
the Venn limit is identical with the ratio given by the first definition. 
This is simply false ; and though thi.s claim is sometimes made by good 
mathematicians it appears that they must have temporarily forgotten 
the nature of a mathematical limit. The actual number of trials is 
always finite, and in the mathematical sense gives no information 
whatever about the result of an infinite series, unless the law connecting 
successive terms is givmn ; and there is no such law for random selection. 
It has been argued that for a finite population, sampled without replace- 
ment, the limit must be the ratio in the population. This is true, but 



348 FREQUENCY DEFINITIONS AITO DIRECT METHODS Chap. VII 

it gives no meaning to the statement that the ratio in m trials is likely to 
agree with that in the population to order If the selection con- 

sisted of picking out all members of one type before proceeding to the 
other, the first statement would be true, but the second would be hope- 
lessly wrong, and it is the second that we need for any useful theory. 
For sampling with replacement, even with a finite population, there is 
no logical proof that we shall not go on picking the same member for 
ever. This is relevant to the argument concerning hands at cards. The 
usual assessment of the chance of getting the ace m times in n deals 
receives an attempted justification from the fact that we should get it 
in just this ratio if we got each possible deal once and once only. But 
unfortunately the conditions refer to sampling with replacement. Long 
before some deals had occurred some of the earlier ones would have 
occurred many times, and the argument cannot be applied. The 
difficulty will be appreciated by those who have tried to obtain a 
complete set of cards, one by one, from cigarette packets each contain- 
ing one. A dozen of one card may be obtained before some others have 
appeared at all. 

Some doubt is apparently felt by the advocates of these definitions, 
who are liable to say when challenged on a particular mathematical 
point that the statement is ‘reasonable’. But this gives away the entire 
case. The only excuse for the definitions is that they exclude the notion of 
‘reasonable’ in contrast to ‘mathematically proved’, and they therefore 
invite challenge on mathematical grounds. If an actual rnalheinatical 
proof cannot be given, showing that a different result is simply impossible, 
the result is not proved. To say then that it is reasonable is mathematically 
meaningless, and grants that ‘reasonable’ has a meaning, which is indis- 
pensable to the theory, and which is neither a mathematical nor an objective 
meaning. If it follows assignable rules they should be stated, which is 
what has been done here; if it does not, my Axiom 1 is rejected, and it 
is declared that it is reasonable to say, on the same data, both that p is 
more probable than q and q more probable than p. Curiously, however, 
the extreme tolerance expressed in such an attitude does not appear to 
be borne out in practice. The statistical journals are full of papers each 
maintaining, if not that the author’s method is the only reasonable 
one, that somebody else’s is not reasonable at all. 

7.05. The most serious drawback of these definitions, however, is 
the deliberate omission to give any meaning to the probability of a 
hypothesis. All that they can do is to set up a hypothesis and give 
arbitrary rules for rejecting it in certain circumstances. They do not 



FREQUENCY DEFINITIONS AND DIRECT METHODS 


349 


§ 7.0 


say what hypothesis should replace it in the event of rejection, and there 
is no proof that the rules are the best in any sense. The scientific law is 
thus (apparently) made useless for purposes of inference. It is merely 
something set up like a coconut to stand until it is hit; an inference 
from it means nothing, because these treatments do not assert that 
there is any reason to suppose the law to be true, and it thus becomes 
indistinguishable from a guess. Nevertheless in practice much con- 
fidence is placed in these inferences, if not by statisticians themselves, 
at least by the practical men that consult them for advice. I maintain 
that the practical man is right; it is the statistician’s agnosticism that 
is wrong. The statistician’s attitude i.s, of course, opposite to that of the 
applied mathematician, who asserts that his laws are definitely proved. 
But an intermediate attitude that recognizes the validity of the notion 
of the probability of a law avoids both difficulties. 

The actual procedure is usually independent of the definitions. A 
distribution of chance is set up as a hypothesis, and more complicated 
probabilities are derived from it by means of the addition and product 
rules. I have no criticism of this part of the work, since the distribution 
is always at the very least a suggestion worth investigation, and the 
two rules apjiear also in my theory. But the answer is necessarily in the 
form of a distribution of the chance of different sets of observations, 
given the same hypothesis. The practical problem is the inverse one; 
we have a unique set of observations and the problem is to decide 
between different hypotheses by means of it. The transition from one to 
the other necessarily involves some new principle. Even in pure mathe- 
matics we have this sort of ambiguity. If r = 1, it follows that 
2 = 0. But if x“-\-x—2 = 0, it does not follow that z = 1. It 
would if we had the supplementary information that x is positive. In 
the probability problem the difficulty is greater, because in any use 
of a given set of observations to choose between different laws, or differ- 
ent v^alues of parameters in the same law, we are making a selection out 
of a range, usually continuous, of possible values of the parameters, 
between which there is originally usually little to choose. (On the Venn 
and Fisher definitions this would mean a decision of which series or 
which population is to be chosen out of a super-population.) The actual 
selection must involve some principle that is not included in the direct 
treatment. The principle of inverse probability carries the transition 
out formally, the prior probability being chosen to express the previous 
information or laek of it. Rejecting the restriction of probabilities to 
those of observations given hypotheses and applying the rules to the 



350 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

probabilities of hypotheses themselves, the principle of inverse proba- 
bility is a theorem, being an immediate consequence of the product 
rule. No new hypothesis is needed. But the restriction spoken of makes 
some new hypothesis necessary, and we must examine what this is. 

7.1. ‘Student’s’ treatment of the problem of the uncertainty of the 
mean of a set of observations derived from the normal law provides 
an interesting illustration, and has the further merit of being accepted 
by all schools. The result actually proved is 2.8 (18) 

P(dz lx, a, H) oc dz, (1) 

where x and a are the true value and standard error, supposed known, 
and if x and s are the mean and standard deviation of the observations. 



My result is, 3.41 (6), 

P(dz 1 6B) oc (1-f dz, (3) 

which, since the right side involves the observations only through i and 
s, leads, by the principle of the suppression of irrelevant data (1.7), to 

P(dz I i, a, //) oc (1 dz. (4) 

This is not the same thing as (1) since the data are different. The usual 
way of stating (1) speaks of the probability of a proposition by itself 
without explicit mention of the data, and we have seen how confusing 
assessments on different data may lead to grossly wrong results even 
in very simple direct problems. In a case analogous to this we may note 
that the probability that Mr. Smith is dead to-day, given that he had 
smallpox lawt week, is not the same as the probabihty that he had small- 
pox last week, given that he is dead to-day. But here if we interpret { 1 ) 
to mean (4) we get the correct posterior probability distribution for x 
given X and «, and this is what in fact is done. But (1) certainly does 
not mean (4), and we must examine in what conditions it can imply it. 
We notice first that the inclusion of any information about x and s in 
the data in (1), other than the information already given in the state- 
ment of X, a, and H (the latter involving the truth of the normal law), 
would make it false. For the assessment on information including the 
exact value of either x or « would no longer depend on z alone, but 
would involve the value of x—x or of aja explicitly. For intermediate 
amounts of information other parameters would appear, and would 
appear in the answer. Thus we cannot proceed by including x and s in 
the data in (1) and then suppressing x and cj as irrelevant to get (4); 



FREQUENCY DEFINITIONS AND DIRECT METHODS 


361 


I 7.1 


for if we did this the probabihty of dz would be unity for all ranges 
that included the actual value and zero for all others. 

But we notice that in (1) the values of x and a are irrelevant to z, and 
can therefore be suppressed, by Theorem 11, to give 

P{dz I H) oc ( 1 +s 2 )-'/ 2 » dz, (6) 

since the conditions of observation H entail the existence of x and s, 
X and a, and this is the vital step. On the face of it this says nothing, 
for z has no value unless the quantities x, x, and s are given. But just 
for that reason it is now possible that if we now introduce x and « into 
the data the form will be unaltered. The argument is apparently that 
the location of the probability distribution of x, given x and 8, must 
depend only on x, and its scale must depend only on s. But this amounts 
to saying that I - 

and since x and s are irrelevant to 2 they can be suppressed, and the 
left side reduces to P{dz ] //), which is known from (5). Thus the 
result (4) follows. 

Something equivalent to the above seems to have been appreciated 
by ‘Student’, though it cannot be expressed in his notation. But we 
must notice that it involves two hypotheses: first, that nothing in the 
observations but x and « is relevant; secondly, that whatever they may 
be in the actual observations we are at full liberty to displace or rescale 
the distribution in accordance with them. The first is perhaps natural, 
but it is desirable to keep the number of hypotheses as small as possible, 
whether they are natural or not, and the result is proved by the principle 
of inverse probability. The second can mean only one thing, that the 
true value x and the standard error cr are initially completely unknown. 
If we had any information about them we should not be permitted to 
adjust the distribution indefinitely in accordance with the results of one 
set of observations, and (6) would not hold. ‘Student’ indeed noticed 
this, for his original tables! are entitled ‘Tables for estimating the 
probability that the mean of a unique sample of observations lie 
between — oo and any given distance of the mean of the population 
from which the sample is drawn ’. There is no particular virtue in the 
word ‘unique’ if the probability is on data x, a, H; the rule (1) would 
apply to every sample separately. But when the problem is to proceed 
from the sample to x uniqueness is important. If H contained informa- 
tion from a previous sample, this w'ould not affect (1), since, given x 
and ff, any further information about them would tell us nothing new. 

t Biometriha, ll, 1017, 414. 



362 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

But it would affect the transition from (1) to (6), and this would be 
recognized in practice by combining the samples and basing the 
estimate on the two together. ‘Student’ called my attention to the 
vital word just after the publication of a paper of mine on the subject, f 
showing that he had in fact clearly noticed the necessity of the condition 
that the sample considered must constitute our only information about 
X and O’. The conditions contemplated by him are in fact completely 
identical with mine, and he recognized the essential point, that the 
usefulness of the result depends on the particular state of previous 
knowledge, namely, absence of knowledge. 

It can be shown further that if we take (4) as giving the correct 
posterior probabihty of x, there is only one distribution of the prior 
probability that can lead to it, namely 

P{dxda I H) cc dxdaja. (7) 

For the result implies that the most probable value of x is the mean, 
and that for two observations there is a probability ^ that x lies 
between them. But the former implies a uniform prior probability 
distribution for x, and the latter, by 3.7, implies the daja rule. J Given 
this my argument in 3.4 follows. The irrelevance of information in the 
sample other than x and s holds for all assessments of the prior prob- 
abiUty. Hence the hypotheses made by ‘Student’ are completely equi- 
valent to mine; they have merely been introduced in a different order. 

Similar considerations affect Fisher’s fiducial argument. Speaking of 
‘Student’s’ rule, he says:§ ‘It must now be noticed that < is a continuous 
function of the unknown parameter, the mean, together with observable 
values, X, s, and n, only. Consequently the inequality 

t > 

is equivalent to the inequality 

H < xstJyJn 

so that this last must be satisfied with the same probability as the first. 

. , . We may state the probability that /x is less than any assigned value, 
or the probability that it lies between any assigned values, or, in short, 
its probability distribution, in the light of the sample observed.’ The 
innocent-looking mathematical transformation, however, covers the 
passage from data x and o- to data x and s (Fisher’s fi being my x) 
which the notation used is not adequate to express. The original assess- 
ment was on data including y,, and if these were still being used the 
t Proc. Roy. Soe. A, 160, 1937, 326-48. 

t A proof adapted to the normal la'w of error is given in my paper just mentioned, 
j Ann. Evgen. 6, 1936, 392. 



FREQUENCY DEFINITIONS AND DIRECT METHODS 


353 


8 7.1 


probability that /x is in a particular range is 1 if the range includes the 
known value and 0 if it does not. The argument therefore needs the same 
elaboration as was applied above to that of ‘Student’. It may be 
noticed that in speaking of the probability distribution of /x in the light 
of the sample Fisher has apparently abandoned the restriction of the 
meaning of probability to direct probabilities; different values of /x are 
different hypotheses and he is speaking of their probabilities on the data, 
apparently, in precisely the same sense as I should. He does criticize the 
use of the prior probability in the same paper, but he appears to under- 
stand by it something quite different from what I do. My only criticism 
of both his argument and ‘Student’s’ is that they omit important 
steps, which need considerable elaboration, and that when these are 
given the arguments are much longer than those got by introducing the 
prior probability to express previous ignorance at the start. 

Fisher heads a section in his bookf ‘The significance of the mean 
of a unique sample’ and proceeds: ‘If a;i, Xj,..., is a sample of n' 
values of a variate x, and if this sample constitutes the whole of the 
information on the point in question, then we may test whether the 

mean of x differs significantly from zero by calculating the statistics ’ 

Here we have the essential point made perfectly explicit. The test is 
not independent of previous knowledge, as Fisher is liable to say in 
other places; it is to be used only where there is no relevant previous 
knowledge. ‘No previous knowledge’ and ‘any conditions of previous 
knowledge’ differ as much as ‘no money’ and ‘any amount of money’ do. 

7.1 1 . A different way of justifying the practical use of the rule with- 
out speaking of the probability of different values of x is as follows. Since 
P(dz jx, a, H) is independent of x and a, and of all previous observa- 
tions, it is a chance. If we take an enormous number of samples of num- 
ber n, the fraction with z between two assigned values will approximate 
to the integral of the law between them, by Bernoulli’s theorem. 

This will be true whether x and o are always the same or vary from one 
sample to another. Then we can apparently say that actual values of z 
will be distributed in proportion to the integrals of ( 1 -f and regard 

actual samples as a selection from this population; then the proba- 
bilities of errors greater than will be assigned in the correct ratio 
by the rule that the most probable sample is a fair sample. The trouble 
about the argument, however, is that it would hold equally well if x 
and a were the same every time. If we proceed to say that x lies between 
x±0’76« in every sample of ten observations that we make, we shall be 
f Statistical McAods, 1936, p. 125. 

A a 


ms.Bs 



354 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

wrong in about 5 per cent, of the cases, irrespective of whether x is the 
same every time or not, or of whether we know it or not. It is suggested 
that we should habitually reject a suggested value of x by some such 
rule as this, but applying this in practice would imply that if x was 
known to be always the same we must accept it in 95 per cent, and 
reject it in 5 per cent, of the cases, which hardly seems a satisfactory 
state of affairs. There is no positive virtue in rejecting a hypothesis in 
5 per cent, of the cases where it is true, though it may be inevitable, 
if we are to have any rule at all for rejecting it when it is false, that we 
shall sometimes reject it when it is true. In practice nobody would use the 
rule in this way if x was always the same; samples would always be com- 
bined. Thus, whatever may be recommended in theory, the statistician 
does allow for previous knowledge by the rather drastic means of restrict- 
ing the range of hypotheses that he is willing to consider at all. The rule 
recommended would be used only when there is no previous information 
relevant to x and a. Incidentally Bernoulli’s theorem, interpreted to give 
an inference about what will happen in a large number of trials, cannot 
be proved from a frequency definition, and the passage to an inference in 
a single case, which is the usual practical problem, still needs the notion 
of degree of reasonable belief, which therefore has to be used twice. 

Some hypothesis is needed in any case to enable us to proceed from a 
comparison of different sets of data on the same hypothesis to a com- 
parison of different hypotheses on the same data; no discredit is there- 
fore to be attached to ‘Student’ for making one. It cannot, however, 
be claimed legitimately that the argument is independent of previous 
knowledge. It would be valid only in the special case where there is no 
previous knowledge about x and a, and would not be used in practice 
in any other. The hypothesis that, given 7/ but no information about 
X and a other than that provided by x and «, x and are irrelevant to z 
is essential to the argument. It may be accepted as reasonable, but it 
is none the less a hypothesis. 

7.2. An enigmatic position in the history of the theory of probability 
is occupied by Karl Pearson. His best-appreciated contributions in 
principle are perhaps the invention of x*, the introduction of the product 
moment formula to estimate the correlation coefficient, and the Pearson 
types of error law; besides of course an enormous number of applica- 
tions to special subjects. I should add to these the Orammar of Science, 
which remains the outstanding general work on scientific method, and 
the recognition in it that the Bayes-Laplace uniform assessment of the 



! 7.2 FREQUENCY DEFINITIONS AND DIRECT METHODS 366 

prior probability is not final, but can be revised to take account of 
previous information about the values that have occurred in the past 
in analogous problems. The anomalous feature of his work is that 
though he always maintained the principle of inverse probability, and 
made this important advance, he seldom used it in actual applications, 
and usually presented his results in a form that appears to identify 
a probability with a frequency. In particular his numerous tables of 
chances are mostly entitled frequencies. In determining the parameters 
of laws of his own types from observations he did not use inverse proba- 
bility, and when Fisher introduced maximum likelihood, which is 
practically indistinguishable from inverse probability in estimation 
problems, Pearson continued to use the method of moments. A possible 
reason for this that many would appreciate is that complete tables 
for fitting by moments were already available, and that the fitting of 
a law with four adjustable parameters by maximum likelihood is not a 
matter to be undertaken lightly when sufficient statistics do not exist. 
But Pearson in his very last paper maintained that the method of 
moments was not merely easier than maximum likelihood, but actually 
gave a better result. He also never seems to have seen the full im- 
portance of itself. When the data are observed numbers, he showed 
that the probability of the numbers, given a law, is proportional to 
exp( — with a third-order error. Thus the equivalence of maximum 
likelihood and minimum was Pearson’s result, and the close equiva- 
lence of maximum likelihood and inverse probability in estimation 
problems is so easy to show that it is remarkable that Pearson over- 
looked it. Most of the labour of computing the likelihood is avoided 
if x^ is used instead, though there are complications when some of the 
expectations arc very small; but even these are avoided by the treat- 
ment of 4.2. Fisher repeatedly drew attention to the relation between 
maximum likelihood and minimum but Pearson never accepted the 
consequence that if he used the latter he would have had a convenient 
method, more accurate than the method of moments, and justified by 
principles that he himself had stated repeatedly. 

In practice Pearson used x^ only as a significance test. His method, 
if there were ?i groups of observations, was to compute the complete 
for the data, in comparison with the law being tested. If m parameters 
had been found from the data, he would form the integral 


CD / ^ 

P(x^) = (* x”~"‘~^e-''‘^x’ dx/ f dx, 

X* ** 



366 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

which is the probability, given a law, that the formed from n—m 
random variations in comparison with their standard errors would 
exceed the observed value. (In his earlier use of x® he allowed only 
for one adjustable parameter, the whole number of observations; the 
need to allow for all was pointed out by Fisherf and emphasized by 
Yule.J) If P was less than some standard value, say 0-05 or O’Ol, the 
law considered was rejected. Now it is with regard to this use of P 
that I differ from all the present statistical schools, and detailed atten- 
tion to what it means is needed. The fundamental idea, and one that 
I should naturally accept, is that a law should not be accepted on data 
that themselves show large departures from its predictions. But this 
requires a quantitative criterion of what is to be considered a large 
departure. The probability of getting the whole of an actual set of 
observations, given the law, is ridiculously small. Thus for frequencies 
2.74(6) shows that the probability of getting the observed numbers, in 
any order, decreases with the number of observations like 
for = 0 and like (27rNe)~‘^-^i’"'* for x” = p— !> the latter being near 
the expected value of x^- The probability of getting them in their 
actual order requires division by N\. If mere improbability of the 
observations, given the hypothesis, was the criterion, any hypothesis 
whatever would be rejected. Everybody rejects the conclusion, but this 
can mean only that improbability of the observations, given the hypo- 
thesis, is not the criterion, and some other must be provided. The 
principle of inverse probability does this at once, because it contains an 
adjustable factor common to all hypotheses, and the small factors in 
the likelihood simply combine with this and cancel when hypotheses 
are compared. But without it some other criterion is still necessary, 
or any alternative hypothesis would be immediately rejected also. 
Now the P integral does provide one. The constant small factor is 
rejected, for no apparent reason when inverse probability is not used, 
and the probability of the observations is replaced by that of x® alone, 
one particular function of them. Then the probability of getting the 
same or a larger value of x^ by accident, given the hypothesis, is com- 
puted by integration to give P. If x® is equal to its expectation sup- 
posing the hypothesis true, P is about 0-5. If x^ exceeds its expectation 
substantially, we can say that the value would have been unlikely to 
occur had the law been true, and shall naturally suspect that the law 
is false. So much is clear enough. If P is small, that means that there 
have been unexpectedly large departures from prediction. But why 
t J. B. Suu. Soc. 85, 1922, 87-94. t Ibid., pp. 95-106. 



i 7.2 FREQUENCY DEFINITIONS AND DIRECT METHODS 367 

should these be stated in terras of P ? The latter gives the probability 
of departures, measured in a particular way, equal to or greater than 
the observed set, and the contribution from the actual value is nearly 
always negligible. What the use of P implies, therefore, is that a hypo- 
thesis that may be true may be rejected because it has not predicted observable 
results that have not occurred. This seems a remarkable procedure. On 
the face of it the fact that such results have not occurred might more 
reasonably be taken as evidence for the law, not against it. The same 
applies to all the current significance tests based on P integrals.! 

The use of the integral goes back to Chauvenet’s criterion for reject- 
ing observations. This proceeded as follows. Let P{m) be the chance 
on the normal law of an error greater than mo. Then the chance that 
all of n errors will be less than ma is {1 — and the chance that 
there will be at least one greater than ma is 1 — {1 — P(?«)}''. The first 
estimate of the true value and standard error were used to find the 
chance that there would he at least one residual larger than the largest 
actually found. If this was greater than | the observation was rejected, 
and a mean and a standard error were found from the rest and the 
process repeated until none were rejected. Thus on this method there 
would be an even chance of rejecting the extreme observation even if 
the normal law was true. If such a rule was used now the limit would 
probably be drawn at a larger value, but the principle remains, that an 
observation that might be normal is rejected because other observa- 
tions not predicted by the law have not occurred. Something might be 
said for rejecting the extreme observation if the law gave a small chance 
of a residual exceeding the second largest; then indeed something not 
predicted by the law might be said to have occurred, but to apply such 
a rule to the largest observation is wrong in principle. (Even if the 
normal law does not hold, rejection of observations and treating the rest 
as derived from the normal law is not the best method, and may give 
a spurious accuracy; but the question here concerns the decision as to 
whether the normal law applies to all the n observations.) • 

It must be said that the method fulfils a practical need; but there 
was no need for the pradoxical use of P. The need arose from the fact 
that in estimating new parameters the current methods of estimation 
ordinarily gave results different from zero, but it was habitually found 

t On the other hand, Yates (J. R. Slat. Soc., Suppl. 1, 1934, 217-36) recommends, in 
testing whether a small frequency n, is consistent with expectation, that x* should be 
calculated as if this frequency was n, + l instead of n,, and thereby makes the actual 
value contribute largely to P. This is also recommended by Fislier (Suui^ifal Methods, 
p. 98). It only remains for them to agree that nothing but the tu:tual value is relevant. 



358 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

that those up to about twice the standard error tended to diminish 
when the observations became more numerous or accurate, which was 
what would be expected if the differences represented only random 
error, but not what would be expected if they were estimates of a 
relevant new parameter. But this could be dealt with in a rough 
empirical way by taking twice the standard error as a criterion for 
possible genuineness and three times the standard error for definite 
acceptance. This would rest on a valid inductive inference from analo- 
gous cases, though not necessarily the best one. Now this would mean 
that the former limit would be drawn where the joint probability of 
the observations is of the value for the most probable result, sup- 
posing no difference present, and the latter at This would depend 

on the probability of the actual observations and thus on the ordinate 
of the direct probability distribution, not on the integral. The ordinate 
does depend on the hypothesis and the observed value, and nothing 
else. Further, since nearly all the more accurate tests introduced since 
have depended on the use of distributions that are nearly normal in 
the range that matters, there would be a natural extension in each case, 
namely to draw the two lines where the ordinates are e~^ and e“* ® times 
those at the maximum. The practical difference would not be great, 
because in the normal distribution, for instance, for x large and positive, 

and the exponential factor varies much more rapidly than x. The use 
of a standard value for the ordinate rather than P would give practically 
the same decisions in all such cases. Its choice, however, would rest 
on inductive evidence, which could be stated; there would be no need 
for the apparently arbitrary choice of fixed limits for P, or for the 
paradox in the use of P at all. 

Some feeling for the ordinate seems to lie behind the remarks (see p. 
281) of Yule and Kendall and Fisher on the subject of suspiciously small 

and P very near 1. It is hard to understand these if P is taken as the 
sole criterion, but they become comprehensible at once if the ordinate is 
taken as the criterion; P very near 1 does correspond to a small ordinate. 

7.21 . It should be said that several of the P integrals have a definite 
place in the present theory, in problems of pure estimation. For the 
normal law with a known standard error, or for those sampling problems 
that reduce to it, the total area of the tail represents the probability. 




§ 7.2 FREQUENCY DEFINITIONS AND DIRECT METHODS 369 

given the data, that the estimated difference has the right sign — pro- 
vided that there is no question whether the difference is zero. (If some 
previous suggestion of a specific value of a parameter is to be considered 
at all, it must be disposed of by a significance test before any question 
of estimating any other value arises. Then, strictly speaking, if the 
axljustable parameter is supported by the data the test gives its posterior 
probability as a by-product.) Similarly, the t rule gives the complete 
posterior probability distribution of a quantity to be estimated from 
the data, provided again that there is no doubt initially about its 
relevance; and the integral gives the probability that it is more or less 
than some assigned value. The z rule also gives the probability distribu- 
tion of the scatter of a new set of observations or of means of observa- 
tions, given an existing set. These are all problems of pure estimation. 
But their use as significance te-sts covers a looseness of statement of 
what question is being asked. They give the correct answer if the 
question is: If there is nothing to require consideration of some special 
values of the parameter, what is the probability distribution of that 
})arameter given the observations ? But the question that concerns us 
in significance tests is: If some sjw'cial value has to be excluded before 
we can assert any other value, what is the best rule, on the data avail- 
able, for deciding whether to retain it or adopt a new one ? The former 
is w'hat I call a problem of estimation, the latter of significance. Some 
feeling of discomfort seems to attach itself to the assertion of the 
special value as right, since it may be slightly wrong but not sufficiently 
to be revealed by a test on the data available; but no significance test 
asserts it as certainly right. We are aiming at the best way of progress, 
not at the unattainable ideal of immediate certainty. What happens 
if the null hypothesis is retained after a significance test is that the 
maximum likelihood solution or a solution given by some other method 
of estimation is rejected. The question is, AVhen w^e do this, do we 
expect thereby to get more or less correct inferences than if we followed 
the rule of keeping the estimation solution regardless of any question 
of significance? I maintain that the only possible answer is that we 
expect to get more. The difference as estimated is interpreted as random 
error and irrelevant to future observations. In the last resort, if this 
interpretation is rejected, there is no escape from the admission that 
a new parameter may be needed for every observation, and then all 
combination of observations is meaningless, and the only valid presenta- 
tion of data is a mere catalogue without any summaries at all. 

If any concession is to be made to the opinion that a new parameter 



360 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

rejected by a significance test is probably not zero, it can be only that 
it is considerably less than the standard error given by the test; but 
there is no way of stating this sufficiently precisely to be of any use. 

The use of the P integral in significance tests, however, merely 
expresses a feeling that some standard is required. In itself it is falla- 
cious because it rejects a hypothesis on account of observations that 
have not occurred; its only justification is that it gives some sort of 
a standard which works reasonably well in practice, but there is not 
the slightest reason to suppose that it gives the best standard. Fisher 
writes,'f speaking of the normal law: ‘The value for which P = 0-05, or 
1 in 20, is 1-96 or nearly 2; it is convenient to take this point as a limit 
in judging whether a deviation is to be considered significant or not. 
Deviations exceeding twice the standard error are thus formally re- 
garded as significant. Using this criterion we should be led to follow 
up a false indication only once in 22 trials, even if the statistics were 
the only guide available. Small effects will still escape notice if the data 
are insufficiently numerous to bring them out, but no lowering of the 
standard of significance would meetthis difficulty. ’ Convenient is Fisher’s 
word; there is no claim that the criterion is the best. But the idea that 
the best limit can be drawn at some unique value of P has somehow 
crept into the literature, without apparently the slightest attempt at a 
justification or any ground for saying what the best value is. 

The distinction between problems of estimation and significance arises 
in biological applications, though I have naturally tended to speak 
mainly of physical ones. Suppose that a Mendelian finds in a breeding 
experiment 459 members of one type, 137 of the other. The expecta- 
tions on the basis of a 3: 1 ratio would be 447 and 149. The difference 
would be declared not significant by any test. But the attitude that 
refuses to attach any meaning to the statement that the simple rule 
is right must apparently say that if any predictions are to be made 
from the observations the best that can be done is to make them on 
the basis of the ratio 459/137, with allowance for the uncertainty of 
sampling. I say that the best is to use the 3/1 rule, considering no un- 
certainty beyond the sampling errors of the new experiments. In fact 
the latter is what a geneticist would do. The observed result would be 
recorded and might possibly be reconsidered at a later stage if there was 
some question of differences of viability after many more observations 
had accumulated; but meanwhile it would be regarded as confirmation 
of the theoretical value. This is a problem of what I call significance. 

I StatUtical Methods, p. 46. 



§ 7.2 FBEQUENCY DEFINITIONS AND DIRECT METHODS 361 

But wha^t are called significance tests in agricultural experiments 
seem to me to be very largely problems of pure estimation. When a 
set of varieties of a plant are tested for productiveness, or when various 
treatments are tested, it does not appear to me that the question of 
presence or absence of differences comes into consideration at all. It is 
already known that varieties habitually differ and that treatments have 
different effects, and the problem is to decide which is the best; that 
is, to put the various members, as far as possible, in their correct order. 
The design of the experiment is such that the order of magnitude of the 
uncertainty of the result can be predicted from similar experiments in 
the past, and especially from uniformity trials, and has been chosen so 
that any differences large enough to be interesting would be expected 
to be revealed on analysis. The experimenter has already a very good 
idea of how large a difference needs to be before it can be considered 
to be of practical importance; the design is made so that the uncertainty 
will not mask such differences. But then the P integral found from the 
difference between the mean yields of two varieties gives correctly the 
probability on the data that the estimates are in the wrong order, which 
is what is required. If the probability that they are misplaced is under 
0*05 we may fairly trust the decision. It is hardly correct in such a case 
to say that previous information is not used; on the contrary, previous 
information relevant to the orders of magnitude to be compared has 
determined the whole design of the experiment. What is not used is 
previous information about the differences between the actual effects 
sought, usually for the very adequate reason that there is none; and 
about the error likely to arise in the particular experiment, which is 
only an order of magnitude and by the results found several times in 
this book can be treated as previous ignorance as soon as we have directly 
relevant information. If there are any genuine questions of significance 
in agricultural experiments it seems to me that they must concern only 
the higher interactions. 

7.22. A further problem that arises in the use of any test that simply 
rejects a hypothesis without at the same time considering possible 
alternatives is that admirably stated by the Cheshire Cat in the quota- 
tion at the head of Chapter V. Is it of the slightest use to reject a 
hypothesis until we have some idea of what to put in its place? If 
there is no clearly stated alternative, and the null hypothesis is rejected, 
we are simply left without any rule at all, whereas the null hypothesis, 
though not satisfactory, may at any rate show some sort of corre- 
spondence with the facts. It may for instance represent 90 per cent, of 



362 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

the variation and to that extent may be of considerable use in predic- 
tion, even though the remaining 10 per cent, may be larger than we 
should expect if it was strictly true. Consider, for instance, the history 
of the law of gravitation. Newton first derived it from Kepler’s laws 
and a comparison of the accelerations of the moon and of a body falling 
freely at the earth’s surface. Extending it to take account of the mutual 
attractions of the planets and of the perturbations of the moon by 
the sun, he got the periods and orders of magnitude of the principal 
perturbations. But he did not explain the long inequality of Jupiter 
and Saturn, with a period of 880 years, which gives displacements in 
longitude of 1196" and 2908* of arc for the two planets, f and was only 
explained by Laplace a century later. The theory of the moon has been 
taken only in the present century, by E. W. Brown, to a stage where the 
outstanding errors can be said to be within the errors of observation; 
and even now the theory involves the empirical secular acceleration 
of the mean motion, attributable to tidal friction, a periodic cmjiirical 
term with an amplitude of 10-7" and a period of seventy years, and some 
curious short-period fluctuations that are not satisfactorily explained. 
In fact agreement with Newton’s law was not given by the data insed 
to establish it, because these data included the main inequalities of the 
moon; it was not given during his lifetime, because the data included 
the long inequality of Jupiter and Saturn; and when Einstein's modi- 
fication was adopted the agreement of observation with Newton’s law 
was 300 times as good as Newton ever knew. Even the latter appears 
at present as powerless as Newton’s to explain the long empirical term 
in the moon’s longitude and the secular motion of the node of Venus. 
There has not been a single date in the history of the law of gravitation 
when a modern significance test would not have rejected all laws and 
left us with no law. Nevertheless the law did lead to improvement for 
centuries, and it was only when an alternative was sufficiently precisely 
stated to make verifiable predictions that Newton’s law could be 
dropped— except of course in the cases w'here it is still a valid approxi- 
mation to Einstein’s, which happen to be most of the cases. The test 
required, in fact, is not whether the null hypothesis is altogether satis- 
factory, but whether any suggested alternative is likely to give an im- 
provement in representing future data. If the null hypothesis is not 
altogether satisfactory we can still point to the apparent discrepancies 
as possibly needing further attention, and attention to their amount 

t I am indebted for the values to Mr. D. H. Sadler; they are from G. W. Hill, Attro- 
nomictU Papers oj the American Ephemerie, vols. iv and vii. 



FKEQUENCY DEFINITIONS AND DIRECT METHODS 


363 


I 7.2 


gives an indication of the general magnitude of the errors likely to arise 
if it is used; and that is the best we can do. 

7 .23. The original use of y® involves a further difficulty, which could 
occur also in using Fisher’s z, which is the extension of fo take 
account of the uncertainty of the standard error. If we have a set of 
frequencies, n—m of which could be altered without producing an 
inconsistency with the marginal totals of a contingency table, their 
variations could be interpreted as due to n—m possible new functions 
in a law of chance, which would then give x“ = or they could be due 
to a failure of independence, a tendency of observations to occur in 
bunches increasing x^ systematically without there necessarily being 
any departure from proi)ortionality in the chances. We have seen the 
importance of this in relation to the annual periodicity of earthquakes. 
Similarly, when the data are measures they can be divided into groups 
and means taken for the groups. The variation of the group means 
can be compared with the variations in the groups to give a value of z. 
But this would be increa.sed either by a new function affecting the 
measures or by a failure of independence of the errors, which need not 
be expressible by a definite function. The simple use of or of z 
would not distinguish between these; each new function or a failure of 
independence would give an increase, which might lead to the rejection 
of the null hyf)othesis, but we shall still have nothing to put in its place 
until we have tested the various alternatives. What is perhaps even 
more .serious is that with a large number of groups the random variation 
of x~ on the null hypothesis is considerable, and a systematic variation 
that would be detected at once if tested directly may pass as random 
through being mixed up with the random error due simply to the arbi- 
trary method of grouping (cf. 2 . 7 (), p, 91 ). Fisher of course has attended 
to this point very fully, though some of his enthusiastic admirers seem 
to have still overlooked it. Both with x“ and z it is desirable to separate 
the possible variation into parts when the magnitude of one gives little 
or no information about what is to be expected of another, and to 
test each part separately. The additive property of x^ makes it easily 
adaptable for this purpose. Each component of variation makes its 
separate contribution to x*> separates into factors, 

so that the contributions are mutually irrelevant. It is for this 
reason that x^ s.nd have appeared explicitly in my tests where 
several new parameters are associated. The x* here is not the com- 
plete x*> but the contribution for the possible component variations 
directly under consideration. Whether the random variation is more 



364 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

or less than its expectation (so long as it is random) is irrelevant to 
the test. 

7.3. The treatment of expectations is another pecuhar feature of 
Pearson’s work. The choice of a set of functions of the observations, 
and equating them to the expectations given the law under considera- 
tion, is often a convenient way of estimating the parameters. Pearson 
used it habitually in the method of moments and in other work. It is 
not necessarily the best method, but it is liable to be the easiest. But 
it is often very hard to follow in Pearson’s presentations and in those 
of some of his followers. It is indeed very difficult on occasion to say 
whether in a particular passage Pearson is speaking of a function of the 
observations or the expectation that it may be an estimate of. When he 
speaks of a ‘mean ’ he sometimes intends the mean of the observations, 
sometimes the expectation of one observation given the law, and the 
complications become greater for higher moments. The transition from 
the function of the observations to the corresponding expectation in- 
volves a change of data, which is passed over without mention even when 
the use of inverse probability may be recommended a few pages later. 

7.4. The general agreement between Professor R. A. Fisher and myself 
has been indicated already in many places. The apparent differences 
have been much exaggerated owing to a rather unfortunate discussion 
some years ago, which was full of misunderstandings on both sides. 
Fisher thought that a prior probability based on ignorance was meant 
to be a statement of a known frequency, whereas it was meant merely 
to be a formal way of stating that ignorance, and I had been insisting 
for several years that no probability is simply a frequency. 1 thought 
that he was attacking the ‘Student’ rule, of which my result for the 
general least squares problem was an extension; at the time, to my 
regret, I had not read ‘Student’s’ papers and it was not till considerably 
later that I saw the intimate relation between his methods and mine. 
This discussion no longer, in my opinion, needs any attention. My main 
disagreement with Fisher concerns the hypothetical infinite population, 
which is a superfluous postulate since it does not avoid the need to 
estimate the chance in some other way, and the properties of chance 
have still to be assumed since there is no way of proving them. Another 
is that, as in the fiducial argument, an inadequate notation enables him, 
like ‘Student’, to pass over a number of really difficult steps without 
stating what hyjwtheses are involved in them. The third is the use of 
the P integral, but Fisher’s alertness for possible dangers is so great 



$ 7.4 FREQUENCY DEFraiTIONS AND DIRECT METHODS 366 

that he has anticipated all the chief ones. I have in fact been struck 
repeatedly in my own work, after being led on general principles to a 
solution of a problem, to find that Fisher had already grasped the 
essentials by some brilliant piece of common sense, and that his results 
would be either identical with mine or would differ only in cases where 
we should both bo very doubtful. As a matter of fact I have applied my 
significance tests to numerous applications that have also been worked 
out by Fisher’s, and have not yet found a disagreement in the actual 
decisions reached. The advantage of my treatment, I should say, is 
that it shows the relation of these methods among themselves, and to 
general principles concerning the possibility of inference, whereas in 
Fisher’s they apparently involve independent postulates. In relation 
to some special points, my jnethods would say rather more for Fisher’s 
than he has himself claimed. Thus he claims for maximum likelihood 
only that it gives a systematic error of order less than n~'^- in the 
ordinary cases where the standard error is itself of order Inverse 
probability makes the systematic error of order He shows also by 
a limiting argument that statistics given by the likelihood lead to esti- 
mates of the population parameters at least as accurate as those given 
by any other statistics, when the number of observations is large. In- 
verse probability gives the result immediately without restriction on the 
number of observations. The fiducial argument really involves hypo- 
theses cqtiivalent to the use of inverse probability, but the introduction 
of maximum likohhood appears in most cases to be an independent 
postulate in Fisher’s treatment. In mine it is a simple consequence of 
general principles. The trouble about taking maximum likelihood as a 
primitive postulate, however, is that it would make significance tests 
impossible, just as the uniform prior probability would. The maximum 
likelihood solution would always be accepted and therefore the simple 
law rejected. In actual application, however, Fisher uses a significance 
test based on P and avoids the need to reject the simple law whether 
it is true or not; thus he gets common-sense results though at the cost 
of some sacrifice of consistency. The point may be illustrated by a 
remark of W. G. Eramettf to the effect that if an estimated difference 
t is less than the adopted limit, it affords no ground for supposing the 
true difference to be 0 rather than 2t. If we adopted maximum likelihood 
or the ixniform prior probability in general there would be no escape 
from Emmett’s conclusion; but no practical statistician would accept 
it. Any significance test whatever involves the recognition that there is 
t B. J. Psych. 26, 1936, 362-87. 



366 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

something special about the value 0, implying that the simple law may 
possibly be true; and this contradicts the principle that the maximum 
likelihood estimate, or any unbiased estimate, is always the best. 

Fisher has already introduced the useful word ‘fiducial’ for limits, in 
estimation problems, such that there may be on the data a specified 
probability that the true value lies between them. But it seems to be 
supposed that ‘fiducial’ and ‘significant’ mean the same thing, which 
is not the case. 

He has often argued for making a decision rest on the observations 
immediately under consideration and not on any previous evidence. 
This appears to contradict the view that I have developed, that the 
best inference must rest on the whole of the relevant evidence if we are 
to be consistent. The difference is not so great as it appears at first 
sight, however. I find that vaguely recorded evidence is just as well 
ignored, and precisely recorded evidence may require a significance 
test to establish its relevance. He also avoids the tendency of the 
human mind to remember what it wants to believe and forget the rest, 
unless it is written down at the time. With such exceptions as these, 
with respect to which we should concur, Fisher seems to be as willing 
in practice to combine data as I am. In fact, in spite of his occasional 
denunciations of inverse probability I think that he has succeeded 
better in making use of what it really says than many of its professed 
users have. 

7 . 5 . E. S. Pearson and J. Neyman have given an extended analysis of 
significance tests. In any test, if we are to have a rule for detecting the 
falsehood of a law, we must expect to make a certain number of mistakes 
owing to occasional large random errors. If we habitually use a 5 per 
cent. P limit, the null hypothesis will in the ordinary course of events 
be rejected in about 5 per cent, of the cases where it is true. As it will 
often be false, if we choose such a limit the number of such mistakes 
will be less than 5 per cent, of the whole number of cases. It is in this 
sense that Fisher speaks of ‘exact tests of significance’. Pearson and 
Neyman, however, go further. This type of mistake is called an error 
of the first kind. But it is also possible that a new parameter may be 
required and that, owing either to its smallness or to the random error 
having the opposite sign, the estimate is within the range of acceptance 
of the null hypothesis; this they call an error of the second kind, that 
of accepting the null hypothesis when it is false. They have given 
extensive discussions of the chances of such errors of the second kind, 



FREQUENCY DEFINITIONS AND DIRECT METHODS 


367 


§ 7.6 


tabulating their risks for different possible values of the new parameter.t 
I do not think that they have stated the question correctly, however, 
though this attention to errors of the second kind bears some resem- 
blance to the principle that I have used here, that there is no point in 
rejecting the null hypothesis until there is something to put in its place. 
Their method gives a statement of the alternative. But in a practical 
case the alternative will either involve an adjustable parameter or will 
be as definitely stated as the null hypothesis. For instance, the laws 
of gravitation and light of Newton and Einstein involve the same 
number of adjustable parameters, the constant of gravitation and the 
velocity of light appearing in both. Now Pearson and Neyman proceed 
by working out the above risks for different values of the new para- 
meter, and call the result the power function of the test, the test itself 
being in terms of the P integral. But if the actual value is unknown 
the value of the power function is also unknown; the total risk of errors 
of the second kind must be compounded of the power functions over the 
possible values, with regard to their risk of occurrence. On the other 
hand, if the alternative value is precisely stated I doubt whether any- 
body would use the P integral at all; if we must choose between two 
definitely stated alternatives we should naturally take the one that gives 
the larger likelihood, even though each may be within the range of accep- 
tance of the other. To lay down an order of test in terms of the integral 
in such a case would be A'ery liable to lead to accepting the first value sug- 
gested even though the second may agree better with the observations. 

It may, however, be interesting to see what would happen if the new 
parameter is needed as often as not, and if the values when it is needed 
are uniformly distributed over the possible range. Then the frequencies 
in the world would be proportional to my assessment of the prior 
probability. Suppose, then, that the problem is, not knowing in any 
particular case whether the parameter is 0 or not, to identify the cases 
so as to have a minimum total number of mistakes of both kinds. 
Using the notation of 5.0, the chance of q being true and of a being in 
a range da is P(qda \H). That of q’, with a m a range da, and of a 
being in the range da, is P{q' dada \H). If, then, we assign an a^ and 
assert q when |a| < when |ol > o^,, and sampling is random, 

the expectation of the total fraction of mistakes will be 
00 

2 j P(qda \ H)+2 j j P(q'd(xda | H), 

Oc 0 

t Univ, Coll. Land., Slat. Res. Mems. 2, 1938, 26-67, and earlier papers. 


( 1 ) 



368 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

the second integral being over the range of a. Thus the second integral 

€U 

is 2 J P{q' da | H). Now if is chosen to make the total a minimum, 

0 

we must have for small variations about 

P{qda\H)== P(q'da\H). (2) 

But these are respectively equal to 

P{da\H)P(q\a,H) and P(da \ H)P{q' \aJJ)-, 
whence P(q \ a^H) = P(q' \ a^H). (3) 

But this is the relation that defines the critical value. Hence, with world- 
frequencies in proportion to the prior probability used to express 
ignorance, the total number of mistakes will be made a minimum if the 
line is drawn at the critical value that makes A' = 1 . 

Now I do not say that this proportionality holds; all that I should 
say myself is that at the outset we should expect to make a minimum 
number of mistakes in this way, but that accumulation of information 
may lead to a revision of the prior probabilities for further use and the 
critical value may be correspondingly somewhat altered. But what- 
ever the frequency law may be, we notice that it is the values of a near 
a, and therefore, in the cases needing discussion, the small values, that 
contribute most of the second term in (1). Revision would therefore 
alter (3) in the ratio of the numbers of the cases of « = 0 and of small 
values of oc, and therefore K would be altered by a factor independent 
of the number of observations. We should therefore get the best result, 
with any distribution of a, by some form that makes the ratio of the 
critical value to the standard error increase with n. It appears then that 
whatever the distribution may be, the use of a fixed P limit cannot be 
the one that will make the smallest number of mistakes. The absolute 
best is of course unknown since we do not know the distribution in 
question except so far as we can infer it from similar cases. 

7.51 . This procedure has some interest in relation to ‘giving a theory 
every chance’. There are cases where there is no positive evidence for 
a new parameter, but important consequences might follow if it was not 
zero, and we must remember that A > 1 does not prove that it is zero, 
but merely that it is more likely to be zero than not. Then it is worth 
while to examine the alternative q’ further and see what limits can be 
set to the new parameter, and thence to the consequences of introducing 
it. This occurred in the discu^ion of the viscosity of the earth. The 
new parameter here would be the rate of distortion per unit stress when 
the stress is maintained indefinitely long; if it is zero the viscosity is 



§ 7.6 FREQUENCY DEFINITIONS AND DIRECT METHODS 369 

infinite and the strength is finite. There was no positive evidence that 
the parameter is not zero, but if it was the way might be open to large 
distortions under forces acting for a long enough time. It was therefore 
desirable to consider what limits could be assigned to the new para- 
meter from evidence actually available, and to see whether they would 
permit the amounts of distortion that were claimed. Here the use of 
deduction as an approximation would not permit the discussion of q' at 
all, but on recognizing that it is ordy an approximation we are free to 
continue to consider q' and fix limits to its consequences. It was actually 
foundt that the largest admissible value of the new parameter, that is, 
the smallest possible viscosity, led to insufficient distortion under any 
force suggested. This is a case where a hypothesis, that of ultimate 
indefinitely large distortion, is disposed of not only by the lack of posi- 
tive evidence for the new parameter needed to make it possible at all, 
but also by the fact that even on choosing the new parameter to be as 
favourable as possible to it, consistently v ith other evidence, the result 
is still contradicted. 

7.6. The analysis of this chapter is relevant to the standard presenta- 
tions of statistical mechanics, those of Boltzmann and Gibbs. The 
original derivation of the distribution of velocities, that of Maxwell, 
proceeded by supposing, finst, that the probability of a given resultant 
velocity is a function of that velocity alone; secondly, that those for the 
three components separately are independent. From these hypotheses 
Maxwell's law follows. Boltzmann attempted to go more into detail by 
considering the probable effects of collisions, and appeared to show' 
that a function H, representing the departure from a Maxw ellian state, 
would diminish. An objection to Maxwell’s treatment was that he 
assumed independence of the components. But he claimed only to 
consider the steady state, w here this might possibly hold. Boltzmann, 
however, considered departures from the steady state, and assumed 
irrelevance between the positions and velocities of neighbouring mole- 
cules. This is plainly illegitimate if the density is not uniform or if the 
velocity varies systematically between regions. The presence of one 
molecule in a region affords ground for supposing that the region is one 
of high density and therefore gives an excess probability that there will 
be another near to it. A velocity of a molecule implies an excess proba- 
bility that a neighbour has one in a similar direction; in each case sup- 
posing that any original departures from homogeneity have not had 


3BSS.S8 


t Jeffroya, Tfte Karih, 1920, pp. 304—5. 

71 h 



370 FREQUENCY DEFINITIONS AND DIRECT METHODS Chap. VII 

time to be smoothed out. Thus Boltzmann’s treatment is definitely 
worse than Maxwell’s, in spite of its greater complexity. Maxwell 
applied the hypothesis of independence only to the case where it might 
be true; Boltzmann applied it to cases where it quite certainly contra- 
dicts the premisses. His argument affords no ground whatever for sup- 
posing that a system will approach a Maxwellian state, because it is 
only when the final state has been reached that the hypotheses can pos- 
sibly be right. This criticism of the Boltzmann method would be appre- 
ciated by any statistician that understands a correlation coefficient. 

In the treatment of Gibbs no attempt is made to treat the individual 
system; instead, an ensemble of an infinite number is set up and con- 
clusions are drawn as averages over the ensemble. But there is no 
guarantee at all that an average has any relevance to a single system. 
It might, for instance, be merely the mean of two peaks and itself 
correspond to no individual case at all. What is done is to consider the 
state of a system by regarding the n coordinates and n momenta as 
plotted in space of 2n dimensions. Then the values at any instant 
determine the rates of change, by the equations of dynamics, and we 
can consider how the volume of a small region (corresponding to a range 
of different systems) will vary if each point in it moves at the rate so 
specified. Liouville’s theorem shows that it will not vary. By some 
process that is recognized as obscure this is made to lead to the con- 
clusion that the density in this phase space is uniform. Thus Jeansf 
appeals to experiment to say that if a jjroperty is found to hold in 
general for systems that have been left to themselves ibr a long time, 
that must mean either that the representative points crowd into the 
regions where that property holds, which is forbidden by Liouville’s 
theorem; or that the property is true for the whole of the space, and 
therefore, apparently, the distribution of density does not matter and 
may as well be taken uniform. But there is no theoretical reason to 
show that there should be any such properties. FowlerJ gives a similar 
argument, including the statement ‘that such a W really exists is largely 
a pious hope’. What can be done by these methods is at the most to 
obtain relations between properties, assuming that such relations exist; 
they give no explanation of why they should exist. This can be done 
only by considering the individual system and showing that certain 
properties would be expected to hold for any individual system. Any 
sort of averaging is definitely dangerous. 

The fundamental fact appears to be that we do not in general know 

I Dynamical Theory of Oases, 1921rp. 73. } Statistical Mechanics, 1929, p. 12. 



§ 7,6 FREQUENCY DEFINITIONS AND DIRECT METHODS 371 

the initial state of the system sufficiently accurately to predict even one 
collision. Though the equations of classical mechanics would ordinarily 
lead to a unique solution if the initial state was known exactly, and we 
had enough time for the computation, a trifling uncertainty in the 
velocity of one molecule would affect the identity of the first struck by 
it, and this would lead to differences afterwards that would ultimately 
affect the entire system. It is this uncertainty that requires the intro- 
duction of probability at all. For a system with exactly known initial 
conditions there would be a unique trajectory in phase space (classical 
mechanics of course being assumed). But for the actual system we have 
a set of possible trajectories with different probabilities forming a con- 
tinuous set. On account of the collisions, even if these differ only 
slightly originally, they will quickly become widely scattered. The 
essential point is not so much that the volume of an element in the 
phase space remains the same as that its shape is distorted continuously 
between every pair of collisions, and it is broken up and displaced bodily 
at every collision. The result is that if we fix attention on a given 
element of the phase space, the chance that the system will be within 
it after a long time is made up of components from the probabilities of 
all the pos.sible initial states. The tendency of this averaging is to make 
the probability density after a long time uniform, subject to the condi- 
tion that the only admissible states are those with the same invariant 
properties as the original state — such as energy, for all conservative 
systems, and linear and angular momentum, for free systems. The 
density in j)ha.se space thus acquires a definite meaning as a true proba- 
bility, arising ultimately from the fact that we do not know the initial 
state accurately. It leads to inferences about, for instance, the proba- 
bility that there will be a given fraction of the momenta in one direction 
between stated limits, and hence to definite predictions about statistical 
properties such as pressure and density for every individual system. 
Thus the theory does give what is wanted, a prediction about the ulti- 
mate state of the individual system and made with practical certainty.f 
It is in no other sense that the relations found can be considered as 
physical laws or the quantities in them as physical magnitudes. 

The general principles of this kind of averaging are known as ergodic 
theory and have been extensively studied, especially by French and 
Russian authors. J 

■f J'roc. Boy. Soc. A, 160, 1937, 337-47. 

t Cf. M. Fr^ohot, Borel’s Traiti du calcul dea pTohahilitia, t. 1, fasc. 3, 1938; H. and 
B. S. Jeffreys, Methoda of MathamaturU Phyaica, 1946, 148-62. 



VIII 


(JENERAL QUESTIONS 

‘But you seo, I can bcliovo a thing without understanding it. It ’s all a matter 

of training.’ Dorothy L. Sayers, Hai'c His Carcase. 

8.0. Most of the present books on statistics, anti of the longer papers 
in journals, include a careful disclaimer that the authors propose to 
use inverse probability, and emphasize its lack of logical foundation, 
which is supposed to have been repeatedly pointed out. In fact the 
continued mention of a principle that everybody is completely con- 
vinced is nonsense recalls the saying of the Queen in Havilet: ‘The lady 
doth protest too much, methinks.’ Unfortunately some people that 
have examined the question have not been so convinced, and they 
include such first-rate logicians as W. E. Johnson, C. D. Broad, and 
F. P. Ramsey. The objectors, however, mostly seem to understand by 
the principle something so nonsensical that it }}ardJy seems w^orth 
attention, namely that the prior probability is intended to be a known 
frequency. This statement has been repeated by Kendallf since the 
first edition of this book. The essence of the present theory is that no 
probafnlity, direct, prior, or j)osferior, is simply a freqxiency. The funda- 
mental idea is that of a reasonable degree of belief, which satisfies certain 
rules of consistency and can in consequence of these rules be formally 
expressed by numbers by means of the addition rule, which in itself is 
a convention. In many cases the numerical assessment is the same as 
that of a corresponding frequency, but that does not say that the proba- 
bility and the frequency are the same thing oven in those cases. 'J’ho 
fact that physicists describe an atmospheric pressure as 759 millimetres 
does not make a pressure into a length (and meteorologists now give 
the pressure in terms of the millibar, which really is a unit of pressure). 
A number of choices of units so that certain constants of proportionality 
would have measure unity, and then the identification of the constants 
with the number unity, led to the amazing conclusion that the ratio 
of the electrostatic and electromagnetic units of charge, which are 
quantities of the same kind, is the velocity of light ; and instead of seeing 
that this was a reductio ad absurdum several generations of physicists 
tried to justify it. There are signs now that the fact is appreciated. 
The equations of heat conduction and diffusion have the same form, but 
that does not make heat a vapour. The notion of a reasonable degree 

■f The Advanced Theory of Statistics, 1, 178. 



§8.0 GENERAL QUESTIONS 373 

of belief must be brought in before we can speak of a probability; and 
even those writers that do not mention it at the beginning have to use 
it at the end before any application can be made of the results — or else 
avoid the question by allowing the person advised to supply it himself, 
which he does in practice without the slightest difficulty. Even if the 
prior probability is based on a known frequency, as it is in some cases, 
reasonable degree of belief is needed before any use can be made of it. 
It is not identical with the frequency. 

The kind of case where a prior probability may be based on a known 
frequency is the following. Suppose (a) we deliberately make up 10,001 
classes of 10,000 balls each, such that one contains 10,000 white ones, 
the next 9,999 white and 1 black, and so on. We select one of these at 
random and extract a sample of .SO, 20 of which are found to be white 
and 10 black. By the condition of randomness the chance of selecting 
any class for sampling is the same, and the prior probability for its 
composition follows Laplace’s rule. We infer that in the class sampled 
about 3 arc probably white and the rest black, the probabilities for 
other ratios being distributed according to a definite rule. But .suppose 
(b) that classes of 10,000 were chosen at random from a class of number 
10*®, about the composition of which we had no previous information, 
and that we again sampled one of them and found 20 white and 10 black 
balls. Again the ])rior probability follows Laplace’s rule, but for a 
different reason. The posterior probabilities for the class sampled are 
the same in both cases. Case (6) is the one that usually concerns us, 
but the analysis is quite capable of dealing with (a), in which the prior 
probability is based on a known frequency. It may be pointed out that 
if we take a sample from a second class there will be a considerable 
difference in the results in the two cases. For in {a) the probability 
that the composition will have any particular value is almost what it 
was before; the only difference is that since one class, whose ratio was 
probably near 2:1, has been excluded, the probability that the second 
class will yield a sample with a composition in this neighbourhood is 
a shade less than it was before. But in case (b) the first sample is 
effectively a sample from the whole 10>®, and its composition therefore 
implies a high probability that the 2 : 1 ratio holds approximately in 
this, and therefore in the next 10,000, which are another sample from 
it. Thus in case (b) the composition of the first sample gives a consider- 
able increase in the probability that the second wall show a ratio near 
2:1; in case (a) it slightly diminishes it. 

Case (b) is more like what we actually meet; (a) is highly artificial. 



374 


GENERAL QUESTIONS 


Chap. VIII 


But the fact that the inference from the first sample about the particular 
class sampled would be the same in both cases has been found surprising 
by some writers, and it seems worth while to point out that the infer- 
ences drawn about another class or a sample from one would be very 
different. In both cases the notion of reasonable degree of belief is 
involved through the notion of randomness. 

It is often said that some frequency definition is implicit in the work 
of Bernoulli, and even of Bayes and Laplace. This seems out of the 
question. Bayes constructed the elaborate argument in terms of ex- 
pectation of benefit to derive the product rule, which ho could have 
written down in one line by elementary algebra if he was using the Be 
Moivre definition. The limit definition was not stated till eighty years 
later, by Leshe Ellisf and Cournot, { and there is no mention of a limit 
in this part of Bayes’s paper. Did Bayes go to this trouble to prove 
what was already obvious? Again, what can be the point of Laplace's 
‘equally possible’ on any frequency definition? He does not mention 
a limit, which first appeared in the literature after his writings also. 
Surely Laplace’s statement is meant to specify what cases he proposed 
to discuss; ‘equally possible’ is not meant to be true of all possible cases, 
otherwise why mention it ? And if it is not always true the De Moivre 
definition is rejected. In his application to sampling Laplace does take 
the possible numbers in the population as equally possible; but this 
does not say that he was supposing a world population of classes with 
the proportions known to be uniformly distributed. I suggest indeed 
that the author of the Mecanique Celeste was much too great a man to 
have thought anything so ridiculous. His own statement, in the 
Introduction, is ‘La theorie des probabilities n’est que le bon sens 
reduit au calcul’. His problem was simply, using the sample, to find 
out from it what he could about a population of otherwise unknown 
composition; and he said that the composition was otherwise unknown 
by taking the alternatives equally possible, or, as we should now say, 
equally probable. Similarly, Bayes gave an explicit warning again and 
again that the uniform assessment is to be used only when there is no 
information whatever about the composition of the population sampled. 
With such care about this point it seems remarkable that he should have 
omitted to say that the population was drawn from a super-population 
of known composition if he meant it. Such a hypothesis must be re- 
jected on the internal evidence in Bayes’s paper by any significance 

t Camb. Phil. Trans. 8, 1843, 1-6. 

^ Exposition de la thiorie des chances et des prohabiliUs, Paris, 1843. 



§8.0 GENERAL QUESTIONS 376 

test. Similarly, it has been supposed that a limit definition is implicit in 
Bernoulli’s theorem. But, even if the value of the limit was taken for 
granted, the ratio in a finite sample, however large, could mathematically 
still be anything from 0 to 1; the theorem would be mathematically 
meaningless. The ratio in a finite sample, again, has been taken as the 
definition of the probability, and it has been suggested that Bernoulli 
himself intended this to be done. Then did he construct a long and 
difficult mathematical argument,t showing that this ratio would be 
near the probability in the conditions considered if he was going to 
take it as a definition at the end ? And why did he call his book Ars 
Conjectandil I maintain that the work of the pioneers shows quite 
clearly that they were concerned with the construction of a consistent 
theory of reasonable degrees of belief, and in the cases of Bayes 
and Laplace with the foundations of common sense or inductive 
inference. 

In a fairly extensive search I have not succeeded in tracing the origin 
of the belief that the prior probability is supposed to be derived from 
a known frequency. So far as 1 have found, Karl Pearson is the only 
person to have both believed anything like it and advocated the use of 
inverse probability. In several places he appeals to previous instances 
to justify the uniform assessment, which is consistent with the prior 
probability being, not a known frequency, but a degree of confidence 
based inductively on a previously observed frequency. This is entirely 
valid in terms of the present theory, and does not require a frequency 
definition. But also he sometimes says that without such previous 
instances the uniform assessment cannot be used, nor can any other. 
This, however, would make it impos.sible for the theory ever to find its 
first application. In this respect Pearson's statement is unsatisfactory, 
though T do not believe that even in its actual form it identifies an 
inferred frequency with a known one. It is, however, very difficult to 
understand Pearson on the point, because the development of the nature 
of scientific inquiry in the Grammar of Science often appears to be 
inconsistent with his statements in statistical papers, and in spite of his 
great achievements in introducing clarity in the Grammar he himself 
does not appear to have been influenced by them so much as might have 
been expected. With the doubtful exception of Pearson, however, the 
identification of the prior probability with a known frequency, or the 
statement that it must rest on one, is, so far as I have been able to 

t He did not use Stirling’s theorem, and his argument is much more difficult than 
would now be used. 



376 GENERAL, QUESTIONS CTiap. VIII 

trace, to be found only in the witings of opponents. I hope that this 
clears me from the heinous charge of originality. 

8.1. The few critics of my treatment that have not proceeded by attri- 
buting to me views that 1 have explicitly rejected usually say that the 
prior probability is ‘subjective’ or ‘mystical’ and therefore meaningless, f 
or refer to the vagueness of previous knowledge as an indication that 
the prior probability cannot be uniquely assessed. On the former point, 
I should query whether any meaning can be attached to ‘objective’ 
without a previous analysis of the process of finding out what is objective. 
If it is done from experience it must begin with sensations, which are 
peculiar to the individual, and must give an account of how it is possible 
to proceed from the scattered .sen.sations of an individual, including the 
reports of their sensations made to him by other individuals, to some 
set of statements that can form a ])Ossiblc basis of agreement for many. 
We must and do begin with the individual, and we never get rid of 
him, because every new ‘objective’ .statement must be made by some 
individual and appreciated by other individuals. On the other hand, if 
we do not find out by experience what is objective we can do it only 
by imagination. One hesitates to say that critics believe that nothing 
but imagination is objective. 

What the present theory does is to resolve the problem by making 
a sharp distinction between general principles, which are as impersonal 
as those of deductive logic, and are deliberately designed to say by 
themselves nothing whatever about what experience is possible, and, 
on the other hand, propositions that do concern experience and are in 
the first place always merely considered among possible alternatives. 
The latter are possible scientific laws; the former give rules for deciding 
between them by means of experience and for drawing further inferences 
from them. The empirical proposition is always in the first place the 
result of imagination. It becomes a law or an objcctiv’e statement when 
the general rules have compared it with experience and attached a 
high probability to it as a result of that comparison. That is the only 

t The meaning of ‘motaphysics’ and ‘tnyafioisni’ soerns to change witli time. Compare 
the following, from J. L. Lagrange. 1760. I am indebted to Dr. F. Smithies for the 
reference : 

‘For the rest, I do not deny that it ia possible, by the consideration of limiting proceaeea 
from a particular point of view, to prove rigorously the principles of tlie differential 
calculus, but the kind of metaphysics which it is necessary to use in doing so is, if not 
contrary, at least foreign to the spirit of analysis. 

‘ In methods which use the infinitely little, the calculation corrects the false hypotheses 
automatically. . . . The error is destroyed by a second error. ... On the other hand, 
Newton’s method is completely rigorous.’ ’ 



§8.1 GENKRAL QUESTIONS 377 

scientifically useful meaning of ‘objectivity’. If statements about 
possible results of experience were included in the general principles they 
would lead to illegitimate a priori assertions about experience, and these 
might easily be wrong and could be disposed of, as for the first frequency 
definition, only by introducing contradictions. 

It is argued that because P{p | q) depends on both p and q it cannot 
be an objective statement, since different persons with different know- 
ledge would assess different probabilities of This is a confusion, p has 
no probability whatever of itself, any more than x-\-y has any particular 
value for given x if we do not know y. The probabihty of a proposition 
irrespective of the data has no meaning and is simply an unattainable 
ideal. On the other hand, two people both following the rules would 
arrive at the same value of P(p | q). It is a fact that the probabilities 
of a proposition with respect to different data will in general differ, 
and people with different data will make different assessments. But 
this is no contradiction, but merely the recognition of an obvious fact. 
They will arrive at consistent assessments if they tell each other their 
data and follow the rules. We can know no absolute best — that would 
require us to have all possible knowledge. But we can give a unique 
and practically applicable meaning to ‘the best so far as we can tell 
on our existing data’, and that is what the theory does. 

One difficulty that has possibly led to more trouble than has received 
explicit mention is the treatment of vague and half-forgotten empirical 
information. This seems to bo understood in such expressions as ‘un- 
certainty of the previous knowledge’. We have several times been led 
to discuss such information, and the result has always been the same: 
information inadequately recorded can be treated only as a suggestion 
of possible alternatives, and the prior probability used to express 
previous ignorance should still be used. The fault is not in the theory 
but in an imperfection of the human mind that the theory makes it 
possible to correct. The difference between the results of different 
assessments of the prior probability in the same problem is much less 
than the differences between those found by different statisticians that 
agree about little except that the prior probability must be rejected. 

A prior probability used to express ignorance is merely the formal 
statement of that ignorance. It says ‘I do not know’ and leaves the 
posterior probability, if the observations are of any use for the purpose, 
to say ‘You know now’. The statements ‘I do not know x’ and ‘I do 
not know the probability of x’ still continue to be confused. The 
latter is ‘I do not know whether I have any information about x or not’, 



378 


GENERAL QUESTIONS 


Chap. VIII 


which differs from the former as much as x*‘ differs from one having 
been derived from x by one operation of squaring and the other by two. 
I should gravely doubt whether anybody approaching a set of data in 
the latter state of mind could possibly do anything useful with them. 
To speak of ‘an unknown prior probability’ involves either this con- 
fusion or the identification of the prior probability with a world- 
frequency, and no coherent theory can be made until we are rid of both. 

The confusion may arise partly from the fact that probability state- 
ments are sentences in the indicative mood. Thus the question ‘Is 
Mr. Smith at home?’ can be expressed by three sentences in the indi- 
cative mood: 

I do not know whether Mr. Smith is at home. 

I want to know whether Mr. Smith is at home. 

I believe that you know whether Mr. Smith is at home. 

These three sentences contain the whole content of the queslion, and 
the difference from ‘Mr. Smith is at home’ is expressed by a transposi- 
tion of subject and verb and, in print, a symbol called a question- 
mark. The situation implied in these three statements is so common 
that a special symbolism has been introduced into language to express 
it. The prior probability statement is the first. The second is, in a 
scientific problem, indicated sufficiently by our willingness to under- 
take the work of finding the answer; it is a statement of a wish and is 
not a probability statement. The third is a probability statement of 
higher order; and all this is done in speech by a transposition. Yet 
people continue to question whether degrees of knowledge can be 
expressed in symbols. What the prior probability docs, in fact, is to 
state clearly what question is being asked, more clearly than ordinary 
language is capable of doing. And I suggest that this is no mean 
achievement. Many will support me when I say that 90 per cent, of 
the thought in a scientific investigation goes in the preliminary framing 
of the question; once it is clearly stated, the method of answering it is 
usually obvious, laborious perhaps, but straightforward. Consider, for 
instance, the work of G. I. Taylor and H. Quinney on the plasticity 
of copper, f to decide whether the difference between the largest and 
smallest principal stresses at a point, or the Mises function, which is a 
symmetrical function of the three principal stresses, afforded the correct 
criterion for the start of flow. It was known that different specimens of 
the material d i ffered more than the difference between the criteria 
f Phil. Trans. A, 230, 1932, 323-62, 



J8.1 GENERAL QUESTIONS 379 

would be. Hence to answer the question it was necessary to eliminate 
this variation by working on the same specimen throughout. But then 
something that would differ according to the criterion had still to be 
found. They showed that if tension P and shear stress Q were apphed 
simultaneously, the former directly, the latter by torsion, the Mises 
criterion would give flow at a constant value of the stress- 

difference at a constant value of P^-\-^Q^. Here at last was an answer- 
able question clearly stated. The suggested experiment needed care 
and skill, but not much more; the brilliance was in asking the right 
question. It would be easy to give a long list of papers that cannot 
answer the question that they elaim to answer, simply because in- 
sufficient attention has been given to whether the data are suited to 
decide between the possible alternatives. 

Part of the objection to probabihty as a primitive notion is con- 
nected with the belief that ever 3 rthing is vague until it is defined in 
words. Such a belief omits to recognize that some things are perfectly 
intelligible before any definition is available. To try to define such 
things can result only in defining them in terms of something less 
immediately intelligible and failing to give account of estabfished laws. 
For instance, observed colours are found to be associated with different 
measured wave-lengths. This led to the idea that colour should be 
defined in terms of the wave-length and the sensor}" impression rejected. 
This was \ igorously advocated; but had it been acted upon nobody 
would have been able to say that a thing was red until he had actually 
set up a spectroscope and measured the wave-length of the radiation 
coming from it. Not even the persons with the facilities for doing it 
would act on the principle. What the recommendation does is to reject 
an important means of investigation, and the empirical relation 
between colour and wave-length. The behaviourist psychologists reject 
consciousness and thought except so far as they can define them in 
terms of certain minute movements in the throat that go on when the 
person says he is thinking. Consequently, in their system, there are two 
alternatives. (1) A man has no way of knowing whether or what he is 
thinking except by observing these movements. Many people manage 
very well without it. (2) He may admit his own consciousness but 
reject other people’s. That is solipsism, and no two soUpsists can 
understand each other and agree. Eddington, finding the fundamental 
laws of physics symmetrical with regard to past and future, searches 
for sometliing that does vary in one direction with time and finds 
entropy; and therefore defines the order of increasing time as that of 



380 


GENERAL QUESTIONS 


Chap. VIII 


increasing entropy. Consequently he could not know that he wrote the 
Relativity Theory of Protons and Electrons after he discovered the 
mass-luminosity relation except by measuring the entropy of the 
universe on the two occasions. It all seems very difficult. Bertrand 
Russell, who cannot be accused of shirking the logical consequences of 
his postulates, or of refusing to change the postulates when the conse- 
quences are intolerable, has arrived at the conclusion :t ‘Things are 
those series of aspects which obey the laws of physics. That such series 
exist is an empirical fact, which constitutes the verifiability of physics.’ 
Much of what passes for modern theoretical physics consists in the 
application of the first sentence while forgetting the second. To be a 
practical definition it must refer to the laws already known, not to the 
aggregate of all laws. In the former sense it is a possible rule for 
progress; in the latter it is a mere counsel of perfection. But in the 
former sense the fact that series have been found to fit the laws is 
equivalent to saying that laws have been found to fit the aspects. 
Russell, be it noted, does not define an aspect, but juorely gives a rule 
about what aspects are to bo grouped in a series to constitute a thing; 
and the second sentence recognizes that a possible law must be rejected 
if no series of aspects can be found that conform to it. 

Definitions add clarity when something now is defined in terms of 
something already understood; but to define anything alreatly recogniz- 
able is merely to throw valuable information into the wastepaper 
basket. All that can be done is to point to instances where the pheno- 
menon in question arises, in order to enable the reader to recognize 
what is being talked about by comparison with his own mental processes 
and sensations. 

W. E. Johnson J puts the point even more strongly. He remarks that 
some things are ‘so generally and universally understood that it would 
be mere intellectual dishonesty to ask for a definition’. 

8.2. We can never, formally, rule out the possibility that some new 
explanation may be suggested of any set of experimental facts. But we 
have seen that in many cases this does not matter, by 1.(5. Once a law 
has attained a high probability it can be used for inference irrespective 
of its explanation. If an explanation also accounts for several other 
laws, so much the better; there is more for any alternative to explain 
before it can be said to be ais satisfactory as the existing one. The 
question of an alternative becomes effective only when (1) it accounts 

•f Our Knowledge of the External World, 1914, p. 110. J Logic, 1 , 106. 



§8.2 GENERAL QUESTIONS 381 

for most or all of the evidence explained by the first, (2) it suggests a 
specific phenomenon that would differ according to which is right. 
The decision can then be made in accordance with our principles. This 
is the answer returned by the theory of probability to the logical 
difficulty of the Undistributed Middle, or the neglect of an unforeseen 
alternative. The use for inference is valid so long as it involves only 
the use of laws that have already been estabhshed inductively, because 
the laws are in a stronger position than any explanation could possibly 
be. When an explanation is used and applied to predict laws, these 
require test; but now the possible alternative explanations are severely 
limited by the fact that they must agree with the laws already known. 
Incidentally, this meets a possible difficulty with the rule that all 
suggestions have the same prior probability, no matter who makes 
them. The layman in a suljject may be admitted as capable of making 
a good guess, but it is extremely hard for him to make a gue.ss that is 
not contradicted by evidence already known. 

This also answers the problem of ‘.scientific caution’. Everybody 
agrees on the need for caution, but different people, or even the same 
person on different occasions, may have entirely different opinions on 
what caution means. I suggest that the answer is that results should 
always be presented so that they will be of tlie maximum use in future 
work. That involves, for pure estimation, a statement of a location 
parameter and its standard error. But it can never be guaranteed 
that no modification in a law will ever need to be considered; and a 
})ossible .systematic error of observation needs positive evidence for its 
existence just as any other modification does. To assert in advance any 
kind of departure from the suggested law is a reckless statement, 
irrespective of whether the departure considered is a systematic error 
of observation or a ‘physical’ effect that the physicist considers more 
interesting. In both cases the information should be presented so that 
a significance test can be aj)plied when suitable evidence is available; 
and this implies giving the estimated value, the standard error, and the 
number of observations. There is no excuse whatever for omitting to 
give a properly determined standard error. It is a necessity in stating 
the accuracy of any interpretation of the data, if the law is right; if 
the law is wrong, it is necessary to the discovery that it is wrong. All 
statisticians will agree with me here, but my own applications are 
mostly in subjects where the need is still very inadequately appreciated. 
Again, the best way of finding out whether a law is wrong is to apply 
it as far as possible beyond the original data, and the same applies to 



382 


GENERAL QUESTIONS 


Chap. VIII 


any suggested explanation. But if we have not a determination of the 
standard errors of the parameters in the law we have no way of saying 
whether any discrepancy found is genuine or could be removed by a 
permissible readjustment of the parameters, with a corresponding 
improvement in their accuracy. The usual reason given for the omission 
is that there may be some other source of error and that the statement 
of a standard error expresses a claim of an accuracy that future events 
may not justify. This rests on a complete failure to understand the 
nature of induction. It is essential to the possibility of induction that 
we shall be prepared for occasional wrong decisions; to require finality 
is to deny the possibility of scientific inquiry at all. The argument, 
however, does not prevent its users from asserting systematic differ- 
ences when the estimates agree within the amounts indicated by the 
standard errors, supposing these genuine, or from denying them when 
they are flagrant. What we should do is (1) always to draw the most 
probable inference from the data available, (2) to recognize that with 
the best intentions on our part the most probable inference may turn 
out to be wrong when other data become available, (3) to present our 
information in such a form that, if we do make mistakes, they can be 
found out. This can be done by a consistent process, and should not 
be confused with guesswork about other possible effects before there is 
any evidence for their existence or any estimate of their amount. 

8.3. The situation with regard to alternative explanations mentioned 
above actually existed for a long time in relation to the quantum 
theory. The quantum explanation seemed to be demanded by the 
distribution of black-body radiation and by the j)hoto-electric effect; 
it seemed to be denied by the phenomena of interference, notably by 
G. I. Taylor’s experiment, f which obtained interference patterns under 
illumination of intensity so low that it was highly improbable that 
there would ever be two quanta inside the apparatus at once. The 
quantum theory and the continuous emission theory both accounted 
for one set of facts, but each, in its existing form, was inconsistent with 
the facts explained by the other. The proper conclusion was that both 
explanations were wrong, and that either some new explanation must 
be sought or the sets of data recognized as unrelated. But meanwhile, 
physicists based their predictions on the laws; in typos of phenomena 
that had been found predictable by quantum methods, they made their 
predictions by quantum methods; in phenomena of interference they 
t Proc. Camb. Phil. Soc. 15, 1909, 114-15. 



§ 8.3 GENERAL QUESTIONS 383 

made predictions by assuming continuous wave trains. Thus what 
they really did was to proceed by induction from the laws established 
empirically. This was a vaUd process and did not require the assertion 
of any particular explanation of the laws, the latter being entirely 
subsidiary. 

The present position of the quantum theory illustrates another point 
in relation to the theory of probability. There are three main quantum 
theories; but all make the same predictions and for, it may be, the first 
time in the history of physics, the exponents are willing to accept the 
situation and even on occasion to use one another’s methods. The 
theories themselves are not the same, and indeed each contains reference 
to things that have no meaning on another. The treatment of them as 
equivalent refers only to the observable results predicted, and not to 
their actual content. It recognizes that as long as theories lead to the 
same predictions they are not different theories, but merely different 
ways of saying the same thing. The differences are relegated to meta- 
physics. But this is a complete abandonment of naive reahsm, in wliich 
the things with ‘physical reahty’ would be those contained in the 
explanations, and no others. It does not matter, for instance, whether 
an electron is a point charge with an exact position that we do not 
quite know, or a volume di.stribution rather fuzzy at the edges, or 
whether tlie position of the electron is intrinsically meaningless in the 
sense that it cannot be expressed in terms of three Cartesian coordinates 
at all. This attitude is precisely what is reached here; the essential 
tiling is the representation of the probabihty distribution of observable 
events, and therefore the forms of laws and the values of parameters 
in them. Questions that cannot be decided by means of observation 
are best left alone until some way of answering them suggests itself. 

8.4, The modern quantum theories, like the relativity theories, suffer 
from a confusion in the use of the term ‘the rejection of unobservables’. 
‘Unobservable’ is a legacy from naive realism. An observation, strictly, 
is only a sensation. Nobody means that we should reject everything 
but sensations. But as soon as we go beyond sensations we are making 
inferences. When we say that we have observed an object we mean 
that we have had a series of sensations that are coordinated by imagin- 
ing or postulating an object with assigned properties, and that to con- 
tinue to do so will probably lead us to a correct prediction of other 
groups of sensations. ‘To observe an object’ is merely an idiomatic 
shorthand way of writing this; what we really observe is a series of 



384 


GENERAL QUESTIONS 


Chap. Vm 


patches of colour of various shapes, and whether these are correctly 
located in our minds or where we suppose the object to be must be left 
to philosophers. But in naive realism it is taken for granted that we 
do observe the object and that the patches of colour are ‘subjective’ 
and not respectable; and this puts the cart before the horse because 
except through the latter there is no way of finding out anything about 
the object at all. The acceptance of an object with its properties 
depends on the verification of the inferences that it leads to; that is, it is 
required that our sensations without it, or if it had different properties, 
would bo different from what they have actually been. Hence the 
verifiable content can be stated entirely in terms of parameters in laws 
connecting sensations. This is dealt with completely by the theory of 
probability, and for purposes of inference the laws are all we want. If 
we restrict ourselves to the inference of future sensations the concept 
has done its work and serves no other purpose. This would be a possible 
idealist attitude. If we are realists and think that our concepts have 
counterparts in an external world (subject to the critical realist’s 
willingness to change his mind if necessary), we may consider the law 
as a justification of the reality of the concept. But observability of 
a concept can mean nothing but the statement that it suggests new 
parameters in laws connecting sensations, and that the need for these 
parameters is supported by a significance test. Thus the theory of 
probability takes the rejection of observables in its stride. It gives an 
answer to the question whether any parameter is more probably present 
than not, given the actual data. To consider further data that we have 
not is sheer waste of time. We do not say that so-and-so must be 
unobservable-, we say that, with the information at our disposal, it is 
unobserved, and that if we try to take it into account we shall probably 
lose accuracy. To say that it must be unobservable would be illegiti- 
mate; it would be either an a priori statement leading to inferences 
about observations or an induction claiming deductive certainty. f 
The principle really seems to have arisen from a confusion of three 
possible statements of the ‘economy of hypotheses’. (1) In developing 
a logic, as in Principia Mathematica, the number of postulates is reduced 
to a minimum, though some results that appear as theorems appear 
equally obvious intuitively. The reasons for this procedure have been 
discussed under rule 6 of Chapter I. (2) Parameters in a law that make 

t Cf. H. Dingle, Nature, 141,1 938, 21-8. This is an admirable statement of the logical 
position of the principle, except for the omission to consider any realism but naive 
realism. 



§8.4 GENERAL QUESTIONS 386 

no contribution to the results of any observation can be eliminated 
mathematically, leaving the observations to be described only in terms 
of the relevant parameters. When this is done an economy of statement 
may be achieved (possibly at the cost of increased complexity of mathe- 
matical form), but there is no improvement in representing either 
present or future observations, since either form will say precisely the 
same thing about both. (3) The third is the simplicity postulate as used 
in the present theory, which leads to the restatement of Ockham’s 
principle in the form ‘Variation must be taken as random until there is 
positive evidence to the contrary’. This is the principle that we actually 
need. The second principle is always a pure tautology; but in the usual 
statement it becomes the ‘rejection of unobservables’ and is used to 
deny the relevance of any variable not yet considered. It then becomes 
an a priori statement that future observations must follow certain laws, 
whatever the observations may say. Such an inference into the future 
must be an inductive inference based on probability, because it is 
logically possible that the observations may disagree with prediction. 
The third principle deals with such inferences, but the attempt to use 
the second involves a logical fallacy. 

Now I maintain that whatever has been said on the matter, the 
rejection of unobservables in the form stated has never led to a single 
constructive advance, and that in spite of the reluctance of modern 
physicists to pay any serious attention to the problem of induction, 
what they have done is to use induction and then confuse it with 
deduction. Relativity, up to 1920 or so at any rate, did not involve 
any new parameters; the velocity of light, the constant of gravity, the 
mass of the sun, and so on, were all required b}^ previous theories. It 
made changes in the laws but left them expressed in terms of the same 
parameters. The reason for abandoning the old theory was not that it 
involved unobservables such as absolute A^elocity or simultaneity; it was 
that this theory made positive predictions, such as the one sought for 
in the Michelson-Morley experiment, which turned out to be in dis- 
agreement with observation. The rejection of absolute velocity was 
not a, priori; what was done in the special theory of relativity was to 
alter the laws of measurement and light so that they would agree with 
observation. The general theory, in its original form, was obtained by 
a natural analogy with Newtonian dynamics. The coefficients in 
what seemed to be the natural extension of the special theory to take 
gravitational effects into account, were seen to play the part of the 
Newtonian potential U. Far from matter all second derivatives of 

3595,68 Q Q 



386 


GENERAL QUESTIONS 


Chap. Vm 


the latter vanish; near to matter the contracted Cartesian tensor V*C7 
vanishes, but the separate components do not; inside matter V^C/ does 
not vanish, but has a simple relation to the density. Einstein proceeded 
by analogy. He found a second-order tensor that should vanish far 
from matter, contracted it to get the differential equations satisfied 
near matter, and said that these equations wiU be modified inside 
matter. Given, what was already estabhshed, that the Euclid-Newton 
system needed modification, this was the natural procedure to try. But 
it is a suggestion, not an a priori necessity. On this point one may 
refer to Eddington, writing just before the 1919 eclipse expeditionsif 
‘The present eclipse expeditions may for the first time demonstrate the 
weight of light; or they may confirm Einstein’s weird theory of non- 
Euclidean space; or they may lead to a result of yet more far-reaching 
consequences — no deflexion.’ The first alternative refers to the New- 
tonia.n deflexion, which would be half Einstein’s. That w’as Eddington’s 
position before the observational result; Einstein’s theory stood to him 
as the theory of probability says that it should, as a serious possibility 
needing test, not as demonstrable by general principles without refer- 
ence to observation. In other words, Eddington at the proper time 
agreed with me; his later emphasis on the mathematical necessity of 
Einstein’s theory is a case of ‘forgetting the base degrees’. The correct- 
ness of Einstein’s law rests on the fact that it requires no new para- 
meters and gives agreement with observation where the alternatives 
fail. Insistence on the alleged philosophical grounds for it has led to 
their being challenged, and to a tragic neglect of the observational basis. 
The latter is, in fact, appreciably stronger than is provided by the 
mere verification, as I showed in chapters vii-ix of Scientific Inference. 
Starting entirely from observed data and proceeding by generalization 
of laws, introducing new parameters only when observation showed 
them to be necessary, I showed that it was possible by successive 
approximation to build up Euclidean mensuration, Newtonian dyna- 
mics, and the special and general theories of relativity; and that the 
form of Einstein’s da^ is completely determined near the sun by observa- 
tion alone. No further hypothesis is needed, and some of those made 
by Einstein are replaced by others more closely related to laws already 
adopted or by experimental facts. The linearity of the transformation 
of coordinates in the special theory, for instance, need not be assumed. 
It can be proved from the constant measured velocity of light and the 
natural extension of Newton’s first law, that an unaccelerated particle 

t The Obaervatory, March 1919, p. 122. 



§ 8.4 GENERAL QUESTIONS 387 

in one inertial frame must be unaccelerated in another. The object of 
the work was to see whether the observed agreement could be regarded 
as accidental, that is, whether any other possible laws (Newton’s in 
particular) could have given the same results in the range of magnitude 
available; and it was found that no other form would explain on 
Newton’s theory a fact not explained on Einstein’s without leading to 
contradictions elsewhere. For instance, the excess motion of the peri- 
helion of Mercury had been known for ages to be explicable by the 
attraction of an oblate distribution of matter around the sun, such as 
was seen in the zodiacal light; and with a suitable inclination of the 
axis such matter could also explain the excess motion of the node of 
Venus, which is not explicable on Einstein’s theory and is too large to 
be regarded as random error. To explain it by gravitation would require 
enough matter to upset the agreement for the perihelion of Mercury. 
Similarly, it was suggested, I believe by Professor H. F. Newall, that 
the eclipse deflexion could be explained by the refraction of matter 
near the sun. But such Newtonian explanations led to estimates of the 
amoimt of matter needed, and according as it was sohd or gaseous 
the amount of light it would scatter could be estimated. It was found 
that the visible scattered light did not correspond to more than an 
insignificant fraction of what would be implied by the Nevt'tonian 
explanation, f Using some more recent data I find a larger discrepancy. 
Hence there is no Newtonian explanation in sight for either the peri- 
helion of Mercury, the node of Venus, or the eclipse displacement; while 
Einstein’s law explains the first and third. The node of Venus is not 
evidence for Newton’s law, because this does not explain it either. This 
discrepancy is apparently significant, but what it signifies is not clear; 
it may represent some systematic error of observation or internal cor- 
relation of the errors, though these have not been adequately tested. 
What is quite clear, however, is that it is irrelevant to the decision 
between the two laws of gravitation. So far as any law can be proved 
by observation (and no law can be proved at all in any other way), 
Einstein’s law is proved within the solar system. 

The rejection of unobservables in the quantum theory seems to be 
a mere spring-cleaning and to be correctly placed under the second of 
the above principles. The older theories involved many unobservable 
quantities, and left many observable ones uncoordinated. It had be- 
come impossible to see the wood for the trees on account of the com- 
plications of the concepts, and the postulates led to results inconsistent 
t M.N.R.A.S. 80, 1919, 138-54. 



388 


GENERAL QUESTIONS 


Chap. VIII 


with observation. The modern quantum theories have begun by direct 
and successful attempts to coordinate what we know, without attending 
to the details of any deeper interpretation, and this was right as a 
matter of mathematical convenience. But it is no more a rule for 
positive discovery than the fact that a gardener weeds his plot before 
sowing his seed. The important forward step did not come from the 
rejection of unobservables but from the subsequent recognition of 
formal relations. These relations are not inferred from a principle that 
so-and-so must be unobservable — and indeed they are full of new un- 
observables of their own, which have to be eliminated before anything 
verifiable is reached. They are guessed by analogy with Newtonian 
dynamics and asserted because their consequences agree with observa- 
tion, just hke Einstein’s law of gravitation. 

The most elaborate use of the form of the rejection of observables 
criticized on p. 385 is to be found in the works of Eddington, culminat- 
ing in his statement that aU the fundamental laws and constants of 
physics can be predicted from purely epistemological considerations. 
Some comments on his conclusion are given in 5.64; a criticism of his 
general point of view in the Philosophical Magazine paper cited there. 

A warning is needed that the frequent use of the word ‘probability ’ 
in works on quantum theory is no guarantee that the numbers referred 
to are probabilities in any sense or satisfy the laws of probability, and 
that there is reason to suppose that the probability interpretation of 
wave mechanics leads to the conclusion that quantum theory is deter- 
ministic in exactly the same sense as classical mechanics, t 

8.5. Criticism of fallacious logic is usually treated as captious, on the 
grounds that the methods criticized have delivered the goods. It is not 
considered a matter of importance to physics whether the arguments 
are right so long as they somehow give the right answer at the end. 
But the methods have not delivered the goods. The chief advances in 
modem physics were not achieved by the rejection of unobservables or 
by any other alleged general mathematical principle. They were 
achieved by the method of Euclid and Newton: to state a set of 
hypotheses, work out their consequences, and assert them if they 
accounted for most of the outstanding variation. The method was 
inductive, and the claim that the results were obtained in any other 
way is contrary to history. The insistence on the mathematical argument 
as a proof, in turn, invites challenge on grounds of logic; either it is 
t Cf. Phil. Mag. (7), 33, 1942, 816-31. 



§ 8.6 GENERAL QUESTIONS 389 

important or it is not. If it is, it must be prepared to meet logical 
criticism by a logical answer; if it is not, it should be dropped and 
cease to make bad logic an essential part of what is supposed to be 
mathematics. Above all, it should cease to obstruct the development 
of an adequate theory of induction. 

Reasoning and observation are two different faculties, and it is im- 
portant to keep them separate, as far as possible, and to separate them 
as well as we can if the information presented to us is in such a form 
that they have already been mixed. If this is not done we may find 
ourselves in the position of saying that the argument is right and there- 
fore we do not need observations to test whether we have overlooked 
anything; or that the argument leads to results agreeing with observa- 
tion and therefore must be right however many mistakes are found 
within it. Many modern examples of both could be found. The fol- 
lowing one, though not exactly recent, is an interesting illustration 
of how attention to the details of an argument has actually led to 
constructive results. Laplace in his calculation of perturbations had 
shown that the eccentricity of the earth's orbit should be sj’stematically 
diminishing. This affects the disturbance of the moon by the sun, and 
leads to the result that the moon’s distance should be decreasing, and 
its rate of revolution about the earth increasing. This would alter the 
calculated times of ancient eclipses, and recorded observations of them 
showed that such an effect was required. Laplace gave only the first 
term of the series representing it, but this was near enough to the 
observed value for Plana, Damoiseau, and Hansen to develop the matter 
and include further terms. The agreement at this point seemed entirely 
satisfactory. J. C. Adams, however, worked out the theory afresh! 
found that several neglected terms mounted up. The first two coefficients 
of the series in powers of m, where m is the ratio of the mean motions, 
are whereas Plana had got for the second. On 

account of this enormous numerical coefficient the calculated value of 
the secular acceleration was practically halved, and the agreement 
with observation was destroyed. Adams’s result was confirmed by 
Delaunay and several other dynamical astronomers, who obtained 
further terms. But Pont^coulant said that if the result of Adams were 
admitted it would ‘call in question w'hat was regarded as settled, and 
would throw doubt on the merit of one of the most beautiful discoveries 
of the illustrious author of the Mecanique celeste’. Le Verrier wrote; 
‘Pour un astronome, la premiere condition est que ses theories satis- 
•f Phil. Trails. 143, 1853, 397-406 ; see also several of his coUected papers. 



390 


GENERAL QUESTIONS 


Chap. VUI 


fajssent les observations. Or la th^orie de M. Hansen les repr6sente 
toutes, et Ton prouve 4 M. Delaunay qu’avec ses formules on ne saurait 
y parvenir. Nous conservons done des doutes et plus que des doutes 
sur les formules de M. Delaunay. Tr^ certainement la v^rit4 est du 
c6t6 de M. Hansen.’ Thus the mathematics of Adams and Delaunay 
was to be judged, not by whether the results followed from the equa- 
tions of dynamics, but by whether they agreed with observation; if 
the results disagreed with observation there must be a mistake in the 
mathematics. J. W. L. Glaisher remarks in his biographical notice :t 
Tt is curious that it should have been possible for so much difference of 
opinion to exist upon a matter relating only to pure mathematics, and 
with which all the combatants were fully quahfied to deal, as is clearly 
shown by their previous publications. ’ What happened, in fact, was that 
Adams’s result was so thoroughly confirmed by different methods and 
different investigators that it had to be accepted and the discrepancy 
admitted. But the result was not jjurely destructive. What it did was 
to direct attention to the matter afresh and to lead to the theory of 
tidal friction in a long series of papers by Sir H. Darwin and at the 
present time this appears to give quite satisfactory quantitative agree- 
ment with observation,§ and a large number of constructive results 
about the remote past and future of the solar system, which could never 
have been considered at all if Plana ’s result had stood unquestioned. 

The use of the word ‘theory’ in several different senses is perhaps 
responsible for a good deal of confusion. What 1 prefer to call an 
‘explanation’ consists of several parts: first, a statement of hypotheses; 
secondly, the systematic development of their consequences; thirdly, the 
comparison of those consequences with observation. It still sometimes 
happens, as in some passages just quoted, that the fact that the alleged 
consequences agree with some observations is a proof both that the 
hypotheses are right and that the intermediate steps have been correctly 
worked out. What is liable to be true is that the intermediate develop- 
ment involves numerous begged questions, the answers having been 
chosen so as to agree with observation and not because they are conse- 
quences of what has gone before; and that the correct working out of 
the consequences leads to results disagreeing with the very observations 
that the theory is said to explain. In such cases the hypotheses are 
disproved. Further, it is open to anybody to work out other conse- 

t Adams, Collected Works, p. xxxviii. J Scientific Papers, vol. 2. 

§ G. I. Taylor, Phil. Trans. A, 220, 1919, 1-33; Jeffreys, ibid. 221, 1920, 239-64; 
The Earth, 1929, ch. xiv. 



§8.6 GENERAL QUESTIONS 8»1 

quences of the hypotheses and to see whether these agree with observa- 
tion, and if they do not, to suggest a different set of hypotheses. That 
is how science advances. There are some current ‘theories’ that, when 
divested of begged questions, reduce to the non-controversial statement, 
‘Here are some facts and there may be some relation between them’. 

8.6. To recapitulate the main postulates of the present system, we 
have first the main principle that the ordinary common-sense notion 
of probability is capable of consistent treatment. Other theories can 
deny the consistency, but cannot help using the notion. We have also 
Axiom 4, which implies that there is no inconsistency in using the 
addition rule. The rule as it stands is a convention, since other rules 
consistent with the axioms would be possible and would lead to putting 
probabilities in the same order, and all could be compared with a 
standard obtained by considering balls in a bag. Thus the numerical 
assessment merely specifies the rules of a language capable of going 
into more detail than ordinary language. A generalization of the pro- 
duct rule may be needed, justified by the principle adopted in Principia 
Mathermtica that in constructing a logic the postulates should be taken 
in their most general form. These postulates are required in all theories. 
The principle of inverse probability is a theorem. The prior probabilities 
needed to express initial ignorance of the value of a quantity to be 
estimated, where there is nothing to call special attention to a particular 
value, are given by an invariance theory that leads to equivalent results 
for transformations of the parameters, combined with some rules of 
irrelevance to the effect that the actual values of certain parameters, 
especially scale parameters, tell us notliing about those of certain others. 
Where a question of significance arises, that is, where previous considera- 
tions call attention to some particular value, half the prior probability 
is concentrated at that value. This is the simphcity postulate. It needs 
some elaboration when several parameters arise for consideration simul- 
taneously. 

The main results are; (1) a proof independent of hmiting processes 
that the whole information contained in the observations with respect 
to the hypotheses under test is contained in the likelihood, and that 
where sufficient statistics exist other functions of the observations are 
irrelevant; (2) a development of pure estimation processes without 
further hypothesis ; (3) a general theory of significance tests, which allows 
any hypothesis to be tested provided only that it is sufficiently clearly 
stated to be of any use if it is true, declares no empirical hypothesis 



392 


GENERAL QUESTIONS 


Chap. VIII 


to be certain or false a priori, does not require the introduction of the 
P integral to avoid results in contradiction with common sense, and 
leads to a solution of the estimation problem as a by-product of the 
significance test instead of as a separate problem based on contradictory 
hypotheses; (4) arising out of this, an account of how in certain condi- 
tions a law can reach a high probability and inferences from it be treated 
as deductive in an approximate treatment. It thus makes it possible 
to test laws by observation, without making either the unnecessar}'^ 
assumption that laws can be found to fit the observations exactly, or 
the false one that laws known to us at present do; thus it gives a formal 
account of the actual process of learning. Further, it solves the problem 
of the rejection of unobservables, replacing a useless mathematical 
platitude by a practical criterion; removes the paradoxical appearance 
of the uncertainty principle; meets the logical difficulty of the undis- 
tributed middle; and gives intelligible meanings to ‘scientific caution’ 
and the notion of ‘objectivity’. 

Comment was made in Chapter I on the fact that a formal and 
consistent theory of inductive processes cannot represent the operation 
of every human mind in detail; it will represent an ideal mind, but it 
will also help the actual mind to approximate to that ideal. We have 
had occasion sometimes to call attention to special imperfections, 
notably: (1) wish-fulfilment, expressed sometimes in an exaggerated 
lenience towards one’s own hypotheses, sometimes in a belief that 
tilings can be proved in terms of ordinary mathematics and deductive 
logic when in their very nature they cannot be, and an appearance of 
such a proof is simply a proof that there must be a mistake in it; 
(2) imperfect memory, which can be treated merely as a suggestion of 
alternatives but not as a contribution of observational information 
when the matter is brought up for formal consideration; (3) failure to 
think of the right empirical hypothesis at the time when data are first 
available to test it; (4) limitations of time or industriousness that make 
us content with approximations. The existence of these is no argument 
against the theory; but the theory will provide a standard of com- 
parison for them in psychological studies; psychology is admitted as 
a valid science to the same standards as any other. 

The human mind has also a tendency to exaggerate the differences 
between familiar things and overlook the resemblances. Let us recall 
the reply of Dr. Jervis to a lady who had asked whether Dr. Thorndyke 
was ‘at all human’.! ‘ “He is entirely human,’’ I replied, “the accepted 

t R. Austin Freeman, John Thomdyke'e Cases, p. 00. 



5 8.« GENERAL QUESTIONS 303 

test of humanity being, as I understand, the habitual adoption of the 
erect posture in locomotion, and the relative position of the end of 
the thumb ” 

‘ “I don’t mean that,” interrupted Mrs. Haldean. ‘‘I mean human 
in things that matter.” 

‘‘‘I think those things matter,” I rejoined. ‘‘Consider, Mrs. Hal- 
dean, what would happen if my learned colleague were to be seen in 
wig and gown, walking towards the Law Courts in any posture other 
than the erect. It would be a public scandal.” ’ 

We have, of course, the words ‘person’ and ‘human’, which can apply 
to any member of the species. But though we have six or seven words 
to describe different sexes and ages of the species Canis familiaris, Bos 
taurus, Eqrnis caballus, we have no standard word that can apply to 
any individual of either.! The real reason for the diflBculty in the 
understanding of the theory of probabihty is, I think, that the funda- 
mental ideas and general principles are so familiar that ordinary lan- 
guage has overlooked them, and when they are stated it is immediately 
taken for granted that they must mean something too complicated for 
ordinary language, and a search is made for something to satisfy this 
condition. The truth is that they are too simple for ordinary language, 
and the customary approach renders any understanding impossible. 

8.7. We now return to the question of reahsm versus idealism. The 
question is whether the theory leads to any decision between them. 
Nothing in the theory depends on the acceptance of one or the other, 
and to arrive at a decision in terms of it we must point to some observ- 
able fact that would be more probable on one than on the other. Both 
are admissible hypotheses and we must take their prior probabUities 
as We see that solipsism, the extreme form of idealism, can be 
rejected by the theory. If other people had not minds something like 
my own it would be very improbable that their behaviour would 
resemble mine as much as it does. The behef in a material world is 
on a different footing, since while I seem to be immediately aware of 
my own personality, any object, even my own body, is known to me 
only through sensations. If I was an idealist I should say that I had 
invented it to give a convenient way of describing my sensations (past, 
present, and future, so far as they can be inferred, since we are not 
considering the rejection of induction). A reaUst would say that he 

t Curiously, the infantile ‘bow-wow’, ‘moo-moo’, ‘gee-gee’ can apply to any member 
of the respective species. The loss of general words has taken place in acquiring adult 
language. 



394 


GENERAL QUESTIONS 


Chap. VIII 


meant something more than that, but it is very difficult to say just 
what. Personally I believe that in studying seismology I am finding 
out something about the interior of the earth and not merely making 
predictions about future observations. But in either case the rival 
hj^otheses could be tested only through the sensations predicted from 
them; and the properties that the idealist would assign by convention 
to his imaginary objects would be such as to lead to exactly the same 
predictions as those that the realist would' postulate of the objects that 
he supposes real. Thus the theory of probability makes no decision 
whatever between critical realism and critical idealism, if the latter is 
taken as admitting other personahties; both have probability and 
there appears to be no type of evidence that could alter this. An attempt 
to support idealism has been made by saying that realism involves an 
extra hypothesis and should therefore be rejected if evidence for it is 
not available. This appeal to the economy of hypotheses is not vahd, 
however. It only justifies the omission to assert realism; that is, it still 
leaves us in the position ‘either idealism or realism is true’ but agreeing 
to say no more about it. The denial of the extra hypothesis is just as 
much a hypothesis as its assertion. The conclusion we reach, therefore, 
is that there are forms both of realism and of idealism that would bo 
scientifically tenable, that scientific method cannot decide between 
them, and that it doesn’t matter anyhow. But neither of them is the 
form of realism or idealism usually advocated. Realism has the ad- 
vantage that language has been created by realists, and mostly very 
naive ones at that; we have enormous possibilities of describing the 
inferred properties of objects, but very meagre ones of describing the 
directly known ones of sensations; ‘probability’ is a word of five syl- 
lables, whereas the use of the notion dates from a time when one would 
be beyond our powers. So the idealist must either do his best with 
realist language or make a new one, and not much has been done in the 
latter direction. 

Questions Kke these, that caimot be answered by scientific means, 
may be called metaphysical. (I do not regard this as a mere term of 
abuse.) Another is the distinction between religion and materialism. 
A materialist can hold that all biological phenomena, including evolu- 
tion, are due to physical and chemical causes; he cannot state just why 
a Nautilus evolved into an ammonite, nor why an ammonite did not 
evolve back into a Nautilus, but he cannot be refuted on this ground 
because he can always appeal to the fact that the consequences of the 
laws have not yet been fully worked out and in any case there are 



5 8.7 GENERAL QUESTIONS 396 

presumably physical laws that are not yet known. Bishop Barnes can 
accept evolution and reject the account of creation in Genesis, and hold 
that evolution is the actual way the Creator creates species and that 
He laid down the physical laws in the first place. To him the discovery 
of scientific laws is the discovery of something about how the Creator 
works. Equally he cannot be refuted; it would be impossible to pro- 
duce any piece of observational evidence that could not be dealt with 
in this way. His view and the materialist’s are scientifically equally 
tenable; the choice between them is apparently a matter of what one 
wishes to believe and not of evidence. In spite of G. K. Chesterton’s 
opinion to the contrary, many people do find an emotional satisfaction 
in materialism. The opposition often alleged between religion and 
science arises only when religion ceases to be religion and becomes bad 
science. Actually they are mutually irrelevant. This is fortunate; it 
enables, for instance, both the Jesuit Seismological Association and 
Soviet Russia to prodvice good seismological observations. Similarly 
for the distinction between free will and determinism. The determinist 
can always say ‘it is predestined what I shall do; so there is only one 
course open to me; here goes!’ The Arabian Nights may be studied for 
examples. 

8.8. The present theory does not justify induction. I do not consider 
justification necessary or possible; what the theory does is to provide 
rules for consistency. A prediction is never in the form ‘so-and-so will 
happen’. At the best it is of the form ‘it is reasonable to be higlily 
confident that it will happen’. This may be disappointing, but in the 
last resort that is all that we can say. The former statement is a falla- 
cious claim to deductive certainty; the latter is attainable by a consistent 
process. In this sense we can justify particular applications, and it is 
enough. 



APPENDIX 
TABLES OF K 


We have defined K = . 

where q is the null hypothesis, q' the alternative, H the previous in- 
formation, and d the observational evidence. We take the standard case 
where q and q' are equally probable given H. In most of our problems 
we have asymptotic approximations to K when the number of observa- 
tions is large. We do not need K with much accuracy. Its importance 
is that if > 1 the null h 5 ^pothesis is supported by the evidence; if 
K is much less than 1 the null hypothesis may be rejected. But K is 
not a physical magnitude. Its function is tv> grade the decisiveness of 
the evidence. It makes little difference to the null hypothesis whether 
the odds are 10 to 1 or 100 to 1 against it, and in practice no difference 
at all whether they are 10^ or 10'® to 1 against it. In any case what- 
ever alternative is most strongly supported will be set up as the hypo- 
thesis for use until further notice. The tables give values of ^ or z 
for K = \, 10“*/'“, 10“', 10“’/2, 10“*. The last will be regarded as a limit 
for imconditional rejection of the null hypothesis. K — 10“’/'“ repre- 
sents only about 3 to 1 odds, and would be hardly worth mentioning 
in support of a new discovery. It is at JT = 10“' and less that we can 
have strong confidence that a result will survive future investigation. 
We may group the values into grades, as follows. 


Grade 0. 
Grade 1. 

Grade 2. 
Grade 3. 
Grade 4. 
Grade 5. 


K > 1. Null hypothesis supported. 

I > K > 10“’/“. Evidence against q, but not worth more 
than a bare mention. 

10“’/“ > K > 10“'. Evidence against q substantial. 

10“' > K > 10“’/'“. Evidence against q strong. 

10 “% > X > 10“^. Evidence against q very strong. 

10“2 > K. Evidence against q decisive. 


Any significance test must depend on at least two variables, the 
number of observations and the estimate of the new parameter (more 
usually the ratio of the latter to its estimated standard error). Conse- 
quently any table of K must be a table of at least double entry. In the 
tables I have taken those tests where K depends on not more than two 
variables. In most of each table the computations were based on the 
asymptotic formula, values for small numbers of observations being 



TABLES OP K 


397 


separately computed from the exact formula. Accuracy of a few per 
cent, was considered sufficient, since it will seldom matter appreciably 
to further procedure if K is wrong by as much as a factor of 3. 

It is clear from the tables how accurately it is worth while to do the 
reduction of a given set of observations. Consecutive values of y® or 
for given v usually differ by at least 10 per cent., often by 20 per cent, 
or more. If we get or right to 5 or 10 per cent, we shall in practice 
be near enough, and this implies that the work should be right to about 
6 per cent, of the standard error. Hence as a general rule we should 
work to an accuracy of two figures in the standard error. More will 
only increase labour to no useful purpose; fewer will be liable to put 
estimates two grades wrong. For instance, suppose that an estimate is 
quoted as from 200 observations, to be tested by Table III. This 
might mean any of the following: 

Grade 

4-5±2-5 9-0 2 

3- 5i2-5 1-96 0 

4 0. ±2-0 4-0 0 

Similarly, 5±2 from 200 observations might mean any of: 

Grade 

4'5i2-5 3-24 0 

50 ±2-0 6-25 1 

4- 5±l-5 9-0 2 

50±1'5 IM 3 

5r)±l-5 13-4 4 

The practice of giving only one figure in the uncertainty must therefore 
be definitely condemned, but there is no apparent advantage in giving 
more than two. Similarly, minor correcting factors in K that do not 
reach 2 can be dropped, since decisions that depend on them will be 
highly doubtful in any case. 

It will be noticed in Table I that for small numbers of observations 
AT = 1 is at not much over the standard error. This is rather surprising, 
but becomes less so when we consider the values of in testing an 
even chance from samples of 6 and 6. 


X 

y 

X* 

K 

X 

y 

X* 

K 

5 

0 

50 

A 

6 

0 

60 

A 

4 

1 

1-8 

XSl 

la 

6 

1 

2-7 

IX 

3 2 

3 

2 

0-2 

a 

4 

2 

0-7 

XSiX 

64 





3 

3 

00 

XX 

16 

The exact values of K are 

given 

for comparison. 

For a sample of 6 the 

critical value 

is 

for a 

shade less than 

1-8; 

but this 

means a 4:1 


sample. For a sample of 6 it lies about midway between a 4 : 2 and a 5 ; 1 
sample, corresponding to about 1-7. We notice, however, that 



398 


TABLES OP K 


if = O'l is not attained by the most extreme samples possible. The 
interpretation of these small critical values is not that significance can 
be strongly asserted at them — indeed there is only a probability \ of 
a systematic departure at the critical value anyhow. What they mean 
is that the outside factor is small, and with the best possible agreement 
with the null hypothesis there cannot be more than about 2 to 1 support 
for it. Consequently a smaller value of needed to reduce iT to 1. 
The proper conclusion is that where the data are frequencies small 
samples can tell us little new in any case. 

In Tables I and II the values of x^ ibr given K increase steadily with 
n. I have indicated by italic figures in the upper part of Table I the 
values that have been calculated, but could not in practice arise in a 
sampling problem. It is only for a homogeneous sample of 10 that K 
can first approach 0-01. 

In Tables III and IV the values of for given K begin by decreasing 
as V increases, reach a minimum, and then increase slowly, behaving 
for large v as x^ does in Tables I and II. The difference is of course due 
to the allowance for the uncertainty of the standard error, as in the 
corresponding estimation problems. It is much more important for 
small K than for X = 1. 

Table V is intended to test the agreement of a standard deviation 
with a suggested value. K is not an even function of z and therefore 
it is necessary to tabulate separately for positive and negative z. It is 
actually very nearly an even function of 2/(1 — within the range of 
the table. The asymptotic formula was in satisfactory agreement with 
the exact formula at v = 4. 

It is interesting to compare the results with those based on the 
customary use of the P integral. The usual treatment of the problems 
of Tables I and II would be to draw the line at values of such that they 
have 6 per cent, or 1 per cent, chances of being exceeded on the null 
hypothesis. These limits are, for one new parameter, 3-8 and 6-6; for 
two, 6‘0 and 9'2. In Table I, X — 1 lies below the 5 per cent, point 
up to TC = 70, and passes the 1 per cent, point only about n ■— 1000. 
K = lies below the 5 per cent, point only for n = 5 and 6, and 
reaches the 1 per cent, point about n= 130. 

Similarly, in Table 11 K = 1 lies below the 5 per cent, point up to 
n — 30, and passes the 1 per cent, point at n = 500. K = 10“’^“ never lies 
below the 5 per cent, point, and reaches the 1 per cent, point about n = 40. 

The 6 per cent, and 1 per cent, points for t can be taken from the 
tables given by Fisher, remembering that his « is my v. The former 



TABLES OF K 


399 


drops from <* = 7*8 at v — 4 to 3-8 for v large; it lies between K = \ 
and K — 10“’/= up to about v = 50, and for larger v below K = 1. The 
1 per cent, point lies between X — 10“'^= and X = 10“^ up to about 
V = 200, and below X = 1 for p ~ 1000 and more. 

For 2 (Table V) the 5 per cent, point and X = 1 are close together 
both for positive and negative z. (My negative z corresponds to Fisher’s 
— 2 with infinite.) X = lO-’^^ agrees fairly well with the 1 per cent, 
point, = 0-1 with the 0*1 per cent, point. 

In spite of the difference in principle between my tests and those 
based on the P integrals, and the omission of the latter to give the 
increase of the critical values for large n, dictated essentially by the 
fact that in testing a small departure found from a large number of 
observations wo are selecting a value out of a long range and should 
allow for selection, it appears that there is not much difference in the 
practical recommendations. Users of these tests speak of the 5 per cent, 
point in much the same way as I should speak of the X = point, 
and of the 1 per cent, point as I should speak of the X = 10*’^ point ; and 
for moderate numbers of observations the points are not very different. 
At large numbers of observations there is a difference, since the tests 
based on the integral would sometimes assert significance at departures 
that would actually give X > 1. Thus there may be opposite decisions 
in such ca.ses. But they will be very rare. We may recall that P = O-Ol 
means that if q is true there is a 1 per cent, chance of a larger departure. 
Hence we can apply Bernoulli’s theorem and say that if we assert a 
genuine departure whenever P is less than 0 01 we shall expect to be 
wrong in the long run in 1 per cent, of the cases where q is true. Accord- 
ing to my theory we should expect to make fewer mistakes by taking 
the limit further out; when X = 1 hes above P = 0-01 there will be a 
smaller risk of rejecting q wrongly, partly counter-balanced by a shght 
increase in the risk of missing a small genuine departure. But in these 
conditions the probabihty of a mistake by the use of the 1 per cent, 
hmit for P is so small anyhow that there is httle to be gained by reducing 
it further. Values between the two limits will be so rare that differences 
in practice will hardly ever arise. Thus even though the P tests some- 
times theoretically assert q' when the number of observations is large 
and my tests support q, the occasions will be extremely rare. 

Actually it may appear that such differences are fairly common; it 
is known that when the number of observations is very large the 
estimates of new parameters two to four times the standard error tend 
to be commoner than would be expected if q was true, but that these 



400 


TABLES OF K 


often or usually do not persist in other similar sets of observations. 
This, however, is a false contrast, because these discrepancies do not 
correspond to either the q or to the q’ of the tests considered in these 
tables; they represent internal correlation of the errors or non-indepen- 
dence of the chances, and we have not arrived at the hypothesis actually 
supported by the data until this hypothesis also has been set up and 
considered. But this leads us to a working rule for saying when such a 
hypothe8i8isworthinvestigation;ifanestimategivesA’^ > landP < O-Ol, 
internal correlation should be suspected and tested, for such a result 
would not be expected on the h 5 T)othesis of independence of the errors in 
either case. The use of P by itself involves a danger that discrepancies 
due to failure of independence will be interpreted as systematic. 


/ 2 '? 2 .\ 

Table I. Values of x~ fi'oni K = I — 1 exp(— 


n 

1 

10-1 

K 

io-> 

10-! 

io-» 

5 

1-2 

3-6 

5S 

S 1 

10-4 

6 

1-3 

3-6 

60 

S-2 

10-6 

7 

1-5 

3-8 

61 

H-4 

10-7 

8 

1-6 

3-9 

6-2 

s-a 

JOS 

9 

1-7 

40 

6-3 

8 6 

ion 

10 

1-8 

4-2 

6-5 

8-8 

111 

11 

2-0 

4-2 

6-6 

8 9 

n-2 

12 

20 

4-3 

6-6 

8-9 

11-2 

13 

21 

4-4 

6-7 

90 

11-3 

14 

2-2 

4-5 

6-8 

9-1 

11-4 

15 

2-3 

4-6 

6-9 

9-2 

11-5 

16 

2-3 

4-6 

6-9 

9-2 

n -6 

17 

2-4 

4-7 

7-0 

9-3 

11-6 

18 

2-4 

4-7 

7-0 

9-4 

] I ’6 

19 

2-5 

4-8 

71 

9-4 

n-7 

20 

2-5 

4-8 

7-2 

9-4 

n-8 

30 

3-0 

5-2 

7-6 

9-9 

1 2-2 

40 

3-2 

5-5 

7-8 

10-2 

12-4 

50 

3-5 

5-8 

8-1 

10-4 

12-7 

60 

3-6 

6-9 

8-2 

10-6 

12-8 

70 

3-8 

61 

8-4 

10-7 

130 

80 

3-9 

6-2 

8-5 

10-8 

131 

90 

4-0 

6-4 

8-7 

no 

1.3-3 

100 

4-2 

6-4 

8-8 

111 

13-4 

200 

4-8 

7-2 

9-6 

11-8 

141 

500 

5-8 

81 

10-4 

12-7 

16-0 

1,000 

6-5 

8-8 

IM 

13-4 

16-7 

2,000 

7-2 

9-4 

11-8 

141 

16-4 

6,000 

8-1 

10-4 

12-7 

160 

17-3 

10,000 

8-8 

IM 

13-4 

16-7 

18-0 

20,000 

9-4 

11-8 

141 

16-4 

18-7 

50,000 

10-4 

12-7 

150 

17-3 

19-6 

100,000 

111 

13-4 

16-7 

180 

20-3 




TABLES OF K 


Table II . from K = 


n 

1 

10 -* 

K 

10-1 

10 -* 

io -» 

7 

4-3 

71 




8 

4-6 

7-3 




9 

4-6 

7-4 




10 

4-8 

7-6 




11 

4-9 

7-6 




12 

50 

7-7 

10-3 



13 

61 

7-8 

10-4 



14 

5-2 

7-9 

10-4 



15 

6-3 

8-0 

10-5 



16 

6-4 

8-1 

10-6 



17 

6-4 

8-2 

10-7 



18 

6-6 

8-2 

10-8 



19 

5-6 

8-2 

10-8 



20 

6-6 

8-3 

10-9 

13-4 

16-9 

30 

61 

8-8 

11-3 

13-8 

16 3 

40 

6-6 

9-1 

11-7 

14-2 

16-6 

50 

6-7 

9-4 

11-9 

14-4 

16-8 

60 

6-9 

9-6 

120 

14-6 

17-0 

70 

7-0 

9-7 

12-2 

14-7 

17-1 

80 

7-2 

9-8 

12-3 

14-8 

17-3 

90 

7-3 

100 

12-6 

16-0 

17-4 

100 

7'5 

10- 1 

12-6 

16-1 

17-6 

200 

8-3 

10-9 

13-4 

16-9 

18-3 

500 

9-3 

11-9 

14-4 

16-8 

19-3 

1,000 

10- 1 

12-6 

151 

17-6 

20-0 

2,000 

10-9 

13-4 

15-9 

18-3 

20-7 

6,000 

11-9 

14-4 

16-8 

19-3 

21-7 

10,000 

12-6 

161 

17-6 

20-0 

22-4 

20,000 

13 4 

16-9 

18-3 

20-7 

23-2 

60,000 

14-4 

16-8 

19-3 

21-7 

24-1 

100,000 

161 

17-6 

200 

22-4 

24-8 



TABLES OF K 


4M 


Table III. 


from K = 



-ihv+yt 


V 

1 

10-1 

K 

io -‘ 

10 -* 

io -» 

6 

34 

9-9 




6 

3-4 

8-9 

17-6 



7 

3-4 

8-3 

16-6 



8 

3-6 

8-0 

14-2 



9 

3-6 

7-7 

13-3 



10 

3-6 

7-6 

12-7 

19-2 

27-8 

11 

3-6 

7-4 

12-2 

18-2 

26-8 

12 

3-7 

7-3 

11-8 

17-4 

24-2 

13 

3-7 

7-2 

11-4 

16-8 

23 3 

14 

3-7 

7-2 

11-2 

16-3 

22-4 

16 

3-8 

71 

111 

16-9 

21-6 

16 

3-8 

71 

110 

16-4 


17 

3-9 

71 


151 

20- 1 

18 

3-9 

70 

10-8 

14-8 

196 

19 

3-9 

70 

10-7 

14-6 

19-2 

20 

4-0 

70 

10-6 

14-5 

18-9 

50 

4-6 

7-4 


12-8 


100 

5-2 

7-7 

10-3 

12-8 

16-6 

200 

6-7 

8-2 

10-7 

131 

186 

600 

6-8 

91 

11’4 

13-8 

16-2 

1,000 

7-4 

9-7 


14-3 

16-6 

2,000 

81 

10-4 

12-7 

16-0 

17'3 

5,000 

90 

11-3 

13-6 

16-9 

18’2 

10,000 

9-7 

120 

14-3 

16-6 

18-9 



12-7 

150 

17-3 

19-6 


11-3 

13-6 

16-9 

18’2 



12-0 

143 

16-6 

18-9 

21-2 


Table III a . t^from accurate formula 6.2(33) 


V 

« = 0 

K 

1 

1 

K=l 

1 

10 -* 

: 

10-1 

10 -* 

1 

2-3 

3-9 

30 

1 - 2 x 10 * ! 

2 x 10 ** 

1 

2 

2-7 

3-6 

22 

1 102 

10 * 

j 10 * 

3 

3-0 

3-4 

12-8 

39 1 

120 

370 

4 

3-3 

3-4 

10-6 

26-8 

1 

1 118 

5 

3-6 

3-6 

9-2 

19-4 

37 

1 66 

6 

3-8 

3-6 

8-6 

16-0 i 

1 29 

1 60 

7 

4-0 

3-6 

8-1 

15-0 

1 24-2 

1 38 

8 

4-2 

3-6 

7-9 

13-6 

[ 20-6 

1 31 

9 { 

4-3 j 

3-8 

7-7 

' 13-1 

' 19-5 

29-0 







TABLES OF K 


403 


Table IV. 


from K — 




V 

1 

10-1 

K 

10 -^ 

io -» 

10 -* 

5 

7-3 

18-4 

. . 



6 

7-0 

15-9 




7 

6-8 

14-4 




8 

6-7 

13-1 

22-5 

360 

52-2 

9 

6-7 

12-8 

20-8 

31-3 

45-3 

10 

6-7 

12-3 

19-4 

28-4 

400 

11 

6-7 

120 1 

18-5 

26-5 

36-7 

12 

6-7 

11-7 

17-7 

25'0 

340 

13 

67 

11-5 ! 

17-2 

240 

32-2 

14 

6-7 

11-3 

16-7 

231 

30-6 

15 

6-7 

IM 

16-3 

22-3 

29-3 

16 

6-7 

110 

16-9 

21-6 

281 

17 

6-8 

10-9 

15-6 

210 

27-2 

18 

6-8 

10-8 

15-3 

20-5 

26-5 

19 

6-8 

10 7 

161 

20-2 

25-9 

20 

6-8 

10-7 

150 

19-9 

25-3 

50 

7-3 

10-4 

1 13-6 

16-9 

20-3 

100 

7-9 

10-8 

13-6 

16-4 

19-3 

200 

8-5 

11-2 

13-9 

16-5 

19-2 

600 

9-4 

120 

14'6 

17-2 

197 

1,000 

10-2 

12-8 

15-2 

17-7 

20-2 

2,000 

10-9 

13-4 

16-9 

18'3 

20' 8 

5,000 

11-9 

14-4 

16'8 

19 3 

21-7 

10,000 

12-7 

151 

17-6 

20'0 

22-4 

20,000 

13-4 

15-9 

18-3 

20-8 

23-2 

60,000 

14-4 

16-9 

19-3 

21-7 

241 

100,000 

151 

17-6 

200 

22-4 

24-8 


Table IV a . from accurate formula 6.21 (42) 


V 

t = 0 

K 

Ik=. 

10-1 

io -‘ 

1 

10-8 

10 -* 

1 

2-7 

91 

1,500 1 

lO'o 



2 

3-0 

6-8 

48 

380 

3,300 

32,000 

3 

3-3 

6-5 

24-5 

79 

251 

790 

4 

3-5 

6-2 

18-2 

43-6 

100 

216 

6 

3-8 

6-1 

16-7 

33-6 

70 

138 

6 

40 

60 

13-9 

26-6 

49 

85 

7 

4-2 

6-9 

12-8 

22-2 

36 

55 

8 

4-3 

6-9 

12-3 

20-7 

32-6 

49-1 



404 


TABLES OF K 


Table V. zfrom 5.43 (11) and (14) 


V 

z = 0 

K 

R = 1 

i(H 

10 -' 

lO-i 

10 -* 

K = 1 

10-i 

10 -* 

10-J 

10 -“ 

1 

1-8 

+ 0-77 

- 1 - 1'04 

41-20 

+ 1-31 

+ 1-40 

— 1*4 

- 6-5 i 

y 

? 

? 

2 

2'2 

+ 0-56 

+ 0-76 

-|~ 0-94 

4 1-04 

4 1-12 

— 1-13 

2*2 

- 3-2 

- 4-4 , 

- 5-5 

3 

2-5 

+ 0-47 

-+ 0-70 

+ 0-78 

+ 0-86 

+ 0-94 

-099 

- 1-08 

- 2-30 

- 2-88 

- 3-46 

4 

2-8 

-f 0-45 

+ 0 - fi2 

f 0-72 ; 

40-82 

40-80 

^sSdI 

- 1-17 



- 2-42 

5 

3-1 

-t 0-43 

-^- o•^)7 

f 0-07 ' 

+ 0-75 

+ 0-82 

- 0-05 

— 1-04 

- 1-38 

- 1-71 

- 2-03 

6 

3-4 

4- 0-41 

- I - 0-54 

+ 0-63 ! 

4 0-70 

+ 0-77 

- 0-61 

- 0-94 

- 1-21 

- 1-47 

- 1-72 

7 1 

3-6 

+ 0-39 

+ 0-51 

+ 0-60 

+ 0-85 

+ 0-73 

- 0-57 1 

- 0-85 

~ 108 

- 1-30 

- 1-51 

8 

3-9 

^- 0-37 

+ 0-49 

+ 0-57 

4 0-63 

- 1 ^ 0-09 

- 0-52 

- 0-77 

- 0-98 

- 1-18 

- 1-30 

9 

41 

- 1 - 0-38 

4 0-46 

-f-o-r)4 

+ 0-60 

, -1 0-60 

- 0-49 

- 0-71 

- 0-90 

- 1*08 

- 1-25 

10 

4-2 

+ 0-34 

4 - 0-44 

+ 0-52 

4 0-58 

40-63 

- 0-47 

- 0-67 

— 0-85 

- 101 

- 1-18 

12 

4-6 

f 0-32 

4 - 0-42 

-{- 0-49 

f 0-54 

fO-59 

- 0-43 

- o-oo 

- 0-76 

— 0-89 

— 102 

14 

4*9 

+ 0-31 

+ 0-39 

+ 0-46 

+ 0-51 

-1 0-55 

- 0-40 

- 0-55 

- 0-68 

-081 

- 0-92 

16 

6-2 

+ 0-30 

40-37 

40-43 

40-48 j 

j M 0-52 

- 0-38 

- 0-51 

-003 

- 0-74 

-085 

18 

5-6 

+ 0-29 

+ 0-36 

-f 0-41 

40-46 

- 0-50 

- 0-36 

- 0-48 

- 0-59 

- 0-71 

- 0-78 

20 

6-8 

- hO-27 

+ 0-34 

40-40 

-f 0-44 I 

-, 0-48 

- 0-34 

i - 0-45 

- 0-55 

- 0-68 

- 0-73 

50 

90 

+ 0-20 

40-24 

40-27 

i 40-30 

! {- 0-33 

- 0-22 

- 0-29 

- 0-34 

- 0-38 

- 0-43 

1 




NOTE ON THE CONSISTENCY OF THE PRODUCT RULE 


We assume weaker forms of Axioms 1 , 2 , 3, 4, 5, 6 , namely that they 
hold on a sufficiently general datum H. Any actual datum is supposed 
to contain H. We use Conventions 1 and 2 on H and assume that 
Convention 3 is applicable on H. Then Theorems 1, 2, 3, 4, 5, 6 , 7, 8 
follow if the datum is H. 

Now if p is an additional datum such that P(p\H) ^ 0 , and are 
a set of propositions, exhaustive on H, whose disjunction is Q, we 
assume 


P{qApH) = 


P(p\H) ■ 


( 1 ) 


This provides the first means, in this presentation, of calculating 
probabilities when 11 is not the only datura. Convention 1 on pH 
becomes a rule for the ordering of probabilities in terms of their 
numerical assessments instead of conversely. 

Since the P(pq^\H) satisfy Ax. 1 and P{p\H) is independent of q^, 
it follows that the P{q^\pll) satisfy Ax. 1. Similarly they satisfy Ax. 2, 4, 
Conv. 2 (since if 5 ^, q^ are exclusive on H they are also exclusive on pH] 
and if q^, q^ are exclusive on pH, pq^, pq^ are exclusive on H), and Ax. 5. 

For Ax. 0, we have, ifpg'j- entails r*, 


Pigir.lpH) 


P(pqir,\H) P(pq,\H) 
P{p\H) P{p\H) 


PigilpH), 


using Ax. 6 on data H] hence Ax. 6 holds on data pH. 

Next, ii pH entails qi, P{pq^\H) = P(p\H) by Ax. 6 , and therefore 
P{qi\pH) = 1. Conv. 3 becomes a theorem, and the first part of Ax. 3 
follows. If pH entails ~ q^, pq^ is impossible given H and therefore 
P(pqi\H) — 0, P(qf\pH) = 0 ; hence we have the second part of Ax. 3. 

For Ax, 7, consider two sets of propositions each exhaustive on H, 
say q^, then Ax. 7 will read 


P(gi^k\pH) = P{qi\pH)P(r^\qipH)IP(p\qipH). ( 2 ) 


By (1) this is equivalent to 

PiPQirklH) _ P(pqi\H) IP{pq^p\H) 

P(p\H) P(p\H) P(pq,\H) I P(pq,\H) ’ 

which is an identity. Hence Ax. 7 follows. 

In this presentation we assume no properties of probabilities on data 
other than H, except that they can be calculated by (1), and this is 
possible if the axioms are satisfied by probabilities on H. Hence if pure 



406 NOTE ON THE CONSISTENCY OF THE PRODUCT RULE 

mathematics and the axioms on H are consistent, the axioms remain 
consistent when applied to probabilities on data including H. 

An apparent difficulty about this argument as a general proof of 
consistency is that if H is the general principles of the theory and p a 
special proposition, we may not be able to use Conv. 3 on data H. This 
can be met in two ways. We have seen that the principle of inverse 
probability is consistent if the product rule is consistent for likeli- 
hoods, and therefore it is enough if H in the argument includes a law 
such that we can use Conv. 3; but this is always true for likelihoods. 
The other way is to notice that if H, for instance, expresses ignorance 
of a standard error, we may arbitrarily impose bounds on the possible 
values so that Conv. 3; and our results 

will be consistent as limits of the results when ->-0, oo, and 
infinite integrals are interpreted in this way in any case. This way of 
looking at the matter may be preferred. For if H’ is such that we can 
use Conv. 3 on it, and H differs from H' only by including the statement 
that a standard error is unknown, then all non-zero probabilities on H' 
are replaced by infinite ones on H, & statement that we do not know 
a standard error is apparently accompanied by an instruction to forget 
for a time everything that we ever knew. 



NOTE ON THE INFINITE REGRESS ARGUMENT 

The customary procedure in a mathematical system is to state a set of definitions 
and postulates and to examine what consequences follow from them. It is often 
said that all concepts should be defined and all postulates should be proved. 
It is worth while to point out that to admit this would invalidate any argument. 
Suppose that a system starts from concepts and po.stulates Pi.Pj..., and 

that wo are required to define d,. We may be able (1) to define it in terms of 
dj.dj..., or (2) to define it in terms of a concept A'^ not included in A 2 ,Aj... . 
If (1) is possible the number of fundamental concepts is reduced; but repetition 
of the process for A^ reproduces the same situation. Suppose then that we find 
a set Bi, none of which can be defined in terms of the others, and are asked 
to define fi,. The definition must be in terms of a further concept Cj, which 
would therefore have to bo defined in terms of Z),, and so on for ever. Hence 
wo can never define all the concepts of a system. 

Similarly to prove pj would require a proof from pj.Ps... or the introduction 
of a new postulate, and again we should always find at some stage that the 
proof of a postulate requires the introduction of a new one. 

An argument, the application of which would always lead to the introduction 
of a new definition or postulate not within the system, is said to involve an 
infinite regress. Several arguments in the text are de'.signed to avoid infinite 
regresses (pp. 112, 116, 375), but the principle is not stated in general terms. 

A famous example is Lewis Carroll ’sf ‘What the Tortoise said to Achilles’. 
The propositions p and p implies q imply g. But if we accept p and p implies q 
we cannot symbolize a proof that we can assert q by itself. If we try we find 
ourselves in an infinite regress. The use of ‘therefore’ can be stated, understood, 
and acted on only verbally, not symbolically. 

t Complete Works, 1225-30; Mind, 4 , 1895, 278-80. 



INDEX 


Abbreviations x, dx, 120. 

Accidents, factory, 69, 295. 

Accuracy, useful degree of, 126, 397. 
Adams, J. C., 389. 

Addition rule, 19, 30, 33. 

Agreement, too close, 281. 

A^cultural experiments, 127, 214, 361. 
Aitken, John, 60. 

Alternative hypothesis, 220. 

Amoeba, 6. 

Ancillary statistics, 182. 

Applicability, 8, 9, 11. 

Approximations, 50, 140, 168, 170, 251. 

A priori, 8, 29. 

Argon, 260. 

Arithmetic mean, 84, 92, 107, 189. 

Assent, universal, 14, 46. 

Atmospheric tide, 307. 

Average residual, 188. 

Barnes, E. W., 395. 

Bartlett, M. 8., 41, 53, 147. 

Bateman, H., 69. 

Bayes, T., 29, 30, 34, 42, 102, 107, 109, 374. 
Behaviourism, 45, 379. 

Belief, see Confidence. 

Bellamy, Miss E. F., 324. 

Benefit, expectation of, 30. 

Bernoulli, Daniel, 32. 

Bernoulli, James, 52. 

Bias, 143, 177, 231. 

Binomial law, 50, 66. 

Binomial, negative, 68, 77. 

Black, A. N., 174. 

Boltzmann, 28, 369. 

Bond, W, N., 288. 

Bortkiewicz, L. von, 59. 

Boys, C. V., 280. 

Broad, C. D., 6, 26, 111, 112, 115, 372. 
Brown, E. W., 362. 

Brunt, D., 211, 269. 

Bullard, E. C., 84, 129, 137. 

BuUen, K. E., 176. 

Burnside, W., 345. 

Campbell, N. B., 6, 14, 41. 

Cantelli, F. P., 55. 

Carnap, B., 20. 

Carroll, Lewis, 45, 220, 306, 407. 

Cauchy rule, 78, 81, 170, 189, 244. 
Causahty, 12, 108. 

Caution, 273, 287, 381. 

Central limit theorem, 79. 

Certainty, approach to, 38, 336. 

on the data, 17. 

Chance, 41, 50, 229. 
continuous distribution, 301. 
games of, 32, 47. 

Chapman, S., 307. 

Characteristic function, 73. 

Chauvenet, 188, 201, 357. 


Checking, 139. 

Chesterton, (J. K., 395. 

Combination of estimates, 1 76. 
of tests, 306. 

Common sense, 1. 

Comparison of chances, 235. 

Cororie, L. J., 62. 

Confidence, reasonable degree of, 16. 

Conjunction, 18. 

Consistency, 8, 19, 35, 36, 159, 166, 170, 
251, 406. 

Continental drift, 48. 

Contingency, 211, 232. 
diagonal elements, 332. 

Continuity, 21, 24. 

Continuous variation, 227. 

Conventions, 20, 30. 

Correlation, 71, 152, 263. 
correction of, 202. 
internal, 271, 287, 289, 400. 
intraclass, 72, 198, 268, 276, 314. 
partial, 328. 
rank, 204, 268. 
serial, 170, 227, 328. 

Cournot, A., 374. 

Crnm6r, H.. 80, 343. 

Critical realism and idealism, 46. 

Curvature of miiverse, 304. 

Damoiseau, 389. 

Darwin, Sir G. H., 390. 

Data, need to state, 16, 27, 350, 377. 

Deduction, 1, 3, 17. 
as approximation, 336. 

Definitions, 379. 

Degrees of freedom, 89, 1 28. 

Delauney, 389. 

De Moivre, A., 52, 342. 

Density, probability, 24. 

Design of experiments, 97, 214, 361. 

Determinism, 11. 

Deviation, standard, 92, 128. 

Diananda, P. H., 60, 169. 

Dice, 50, 231, 306, 314. 

Digamma function, 187. 

Dingle, H., 14, 384. 

Dip, magnetic, 84. 

Dirichlet integrals, 87, 116. 

Disjunction, 18. 
separation of, 41. 

Dodgson, C. L., see Carroll, Lewis. 

Dust counter, 60, 241. 

Earthquakes, aftershocks, 325, 334. 
determination of epicentres, 136. 
identity of epicentres, 322. 
law of error, 190. 
periodicity, 324. 
travel times : 

P, 176, 202, 273, 299. 

S and SKS, 265. 



INDEX 


409 


Economy of thought, 4. 

of postulates, 9, 37, 46, 102, 345, 384. 
Eddington, Sir A. S., 6, 193, 283, 379, 
380. 

Edgeworth, F. Y., 108. 

Efficiency, 145, 179. 

Einstein, A., 362, 386. 

Ellis, R. L., 345, 374. 

Emmett, W. G., 365. 

Ensemble, 11, 341. 

Entaihnent, 17, 48. 

Epistemology, 1, 12, 13. 

Equations of condition, 133. 

normal, 133. 

Ergodic theory, 371. 

Errors, 12, 13. 
accidental, 270. 
composition of, 74, 79. 
independence of, 286; see also correla- 
tion, internal, 
normal law, 60, 287. 
probable, 62, 124. 
standard, 62. 
sy.stematic, 270, 273. 
unknown law of, 187. 

Sff also Pearson tyjies. 

E.stimation, 99. 

relation to significance, 3.")9. 

Euclid, 8, 40. 

Exclusive, 18. 

Exhaustive, 18. 

Expectation, mathematical, 31, 43, 177, 
364. 

moral, 31 . 
of benefit, 31, 43. 

Explanation, 303, 390. 

Factorial function, 51, 239. 

Factory accidents, 69, 295. 

Fiducial argument, 352. 

Fiellor, E. C., 55. 

Fisher, R. A., 11, 29, 63, 88, 92, 96, 124, 
127, 152, 169, 179, 181, 184, 189, 197, 
209, 214, 229, 238, 282, 306, 341, 352, 
356, 364. 

Fowler. Sir R. H., 370. 

Franks, W. S., 211. 

Frbchet, M., 371. 

Freedom, degrees of, 89, 128. 

Freeman, R. A., 100, 392. 

Frequency definitions, 11, 34, 342, 372. 
Freud, S., 239. 

Functions, new, 295. 
old, 299. 

Gallon, Sir F., 72. 

Gases, kinetic theory, 28, 369. 

Gauss, C. F., 14, 62, 84, 103, 133, 190. 
Geiger, H., 59. 

Gender, 239. 

Generalization (empirical propositions), 1 , 
3. 

(logical propositions), 7, 26. 

George, W. H., 12. 

Gibbs, Willard, 11, 341, 369. 


Glaisher, 390. 

Godel, K., 35, 56. 

Gosset, W. L., see ‘Student’. 

Grades, 206, 210. 

Gravitation, law of, 362. 

constant of, 280. 

Gravity, 129, 137, 198. 

Greenwood, M., 69. 

Grouping, 136, 184, 193, 326. 

H (definition), 48. 

Haldane, J. B. 8., 107, 120, 162. 
Heisenberg, H., 14. 

Heyl, P. R., 280. 

Hilbert, 10. 

Hill, G. W., 362. 

Horse, kicks by, 59, 71, 295. 

Hosiasson, Miss J., 342. 

Hulme, H. R., 288. 

/bi fiofioed, 158. 

Idealism, 49, 393. 

Ignorance, 34, 101, 220, 222, 353. 
Implication, 17, 48. 

Impossibility, 17. 

Induction, 1, 8. 

Infinite population, 11, 341, 345. 

Infinite regress, 112, 116, 375, 407. 
Inoculation, 239, 312. 

Insufficient reason, 34. 
integer, unknown, 213. 

Internal correlation. 271, 287, 289, 400. 
Intuition, 15. 

Invariance, 104, 158, 170, 248. 

Inverse probability, 29, 35, 372. 
Irrelevance, 28, 41, 42, 160, 163. 

J defined, 158. 

Jeans, Sir J. H., 370. 

Johnson, W. E., 19, 26, 118, 372. 

Joint assertion, 18. 

Jolly, H. L. P., 137. 

Jones, Sir H. Spencer, 278. 

Jourdain, P. E. B., 38. 

K defined, 221. 

tables, 396. 

Kapteyn, 269. 

KendaU, M.G., 49, 81, 88, 96, 210, 239, 
281, 326, 372. 

Keynes, Lord, 26, 59, 147. 

Knott, C. G., 326. 

Knowledge, vague, 107, 121, 152, 219, 226, 
306, 377. 

Lagrange, J. L., 376. 

Lange, J., 238, 356. 

Language, 19, 20, 32, 46, 372, 378, 391. 
Laplace, P. S., 14, 23, 29, 31, 34, 62, 102, 
107, 133, 374, 389. 
rule of succession, 110, 

Latin square, 215. 

Law of large numbers, 52. 

Law, scientific, 3, 13, 99, 113, 220, 336, 
349. 



410 


INDEX 


Least squares, 129. 

approximations, 140, 173. 

Le Verrier, U. J. J., 389. 

Likelihood, 29, 47, 99. 

maximum, 168, 170, 189. 

Limit of sampling ratio, 63, 341, 345. 
Littlewood, J. E., 56, 76. 

Location parameter, 63. 

Logical product, 18, 25. 
quotient, 25. 
sum, 18, 25. 

Liiders, R., 70. 

McCoU, H., 26. 

Materialism, 394. 

Mathematics, pure, 2, 10, 37. 
applied, 2, 3, 12. 

Maximum likehhood, 168, 170, 189. 

relation to invariance theory, 169. 
Maxwell, J. C., 1, 369. 

Mean square contingency, 212. 
deviation, 92. 

Measures, significance tests, 242, 251, 315. 
Median law, 76, 78, 188. 

Median, use of, 187, 293. 

of general law, 148. 

Mendelism, 108, 282, 311, 360. 

Mercury, perihelion of, 387. 

Metaphysics, 394. 

Method and material, 7, 9, 10, 388. 

Milne, E. A., 6. 

Milne-Thomson, L. M., 62. 

Mind, human, 5, 9, 37, 107, 377, 392. 
Mises, B., 341, 346. 

Moments, 73, 74, 76, 183. 

Moon, secular acceleration of, 389. 

Moore, G. E., 17. 

Muirhead, J. H., 14. 

Multinomial law, 57, 90. 

Multiple sampling, 57. 

Multiplicative axiom, 10, 56. 

Naive realism and idealism, 46, 383. 
Negative binomial, 68, 77, 293. 

Newall, H. F., 387. 

Newbold, Miss E. M., 295. 

Newman, M. H. A., 213. 

Newton, 40, 340, 362. 

Neyman, J., 172, 177, 341, 343, 366. 
Nitrogen, density of, 260. 

Normal equations, 133. 

Normal law, derivation, 60, 79. 
d^>arture from, 190. 
estimation problem, 120. 
moments, 78. 
reproductive property, 79. 
significance tests for parameters, 242, 
251. 

test of, 287. 

Nutation, 278. 

Null hypothesis, 229. 

Numbm, introduction of, 19. 

Objectivity, 11, 376. 

Obtervations, rejection of, 188. 


Ockham, 315, 385. 

Offord, A. C., 160. 

P integral, 355, 398. 

Pairman, Miss E., 187. 

Paneth, F. A., 263. 

Parallax, negative, 142, 204. 
stellar, 300. 

Parameters, number admissible, 100, 315. 
location and scale, 63. 
old and new, 222. 
orthogonal, 184, 223. 
suggested values, 109. 

Pearson, E. S., 177, 219, 271, 366. 
Pearson, Karl, 7, 45, 62, 72, 88, 108, 115, 
125, 172, 183, 204, 231, 270, 288, 310, 
354, 374. 

Pearson typos, 64, 185. 

Peirce, C. S., 188. 

Periodicity, 315. 

Perks, W., 170. 

Personal equation, 270. 

Petersburg problem, 32. 

Physici.sts, old-fashioned, 244, 274. 

Plana, 389. 

Poisson law, 58, 68, 77, 119, 237, 240, 293. 
Ponce, John, 315. 

Ponteooulant, 389. 

Postulates, economy of, 9, 37, 46, 102. 
Precision constant, 62. 

Prediction, 1, 4. 13. 14, 40. 

Prineipia Mathematica, 6, 8, 10, 18, 25, 48, 
391. 

Probability, 15. 
aim of theory, 8. 
density, 24. 
posterior, 29. 

■^rior, 29, 34. 

invariance rules: 
estimation, 158. 
significance, 248. 
logarithmic rule, 102, 104, 119. 
of laws, 100. 
revision of, 310. 
truncation of, 142, 197, 203. 
uniform rule, 102. 

Probable error, 62, 124. 

Product rule, 25. 
consistency of, 35, 36, 405. 
incorrect form of, 27. 

Psychoanalysis, 239. 

Psychology. 37, 38. 

Quantum theory, 100, 382, 387. 
Questions, statement of, 91, 108. 

Quiuney, H., 378. 

Quotient, logical, 26. 

Radioactivity, 69, 71, 241. 

Ramsay, F. P., 10, 26, 31, 372. 
Randomization, 214, 272, 297. 
Randomness, 49. 

rule of procedure, 315, 386. 

Rank correlation, 204. 

Rayleigh, Lord, 260. 



INDEX 


411 


Reading of scale, 146. 

Realiam, 44, 393. 

Reality, 338. 

Rectangular law, 66, 86, 143, 184. 
Reduction, uniform, 102. 

Regression, 73. 

Rejection of observations, 188, 280, 287. 

of unobservables, 383, 387. 

Relativity, 30, 48, 386. 

Religion, 394. 

Re-scaling of law, 146. 

Residuals, 133, 188. 

Rounding-ofI errors, 86, 146, 196. 

Russell, Bertrand, 6, 46, 380; see also 
Prindpia Mathematica. 

Rutherford, Lord, 59. 

Sadler, D. H., 362. 

Ssunples, comparison of, 235. 

Sampling, simple, 49, 66, 109. 
multiple, 57, 117. 
with replacement, .50. 

Scale parameter, 63. 

Scale, reading of, 146. 

Schuster, Sir A., 227, 326. 

Sorase, F. J., 60. 

Seidel, 140, 173. 

Selection, allowance for, 226. 

Sheppard, W. F., 62, 195. 

Signidoance, 100, 220. 
approximate form, 251. 
combination of tests, 305. 
complications, 222. 
invariance, 248. 

Simplicity, 4, 100, 103, 113, 222, 391. 
Smithies, F., 376. 

Smoothing, 198. 

Solipsism, 44, 379, 393. 

Southwell, R. V., 174. 

Spearman, C., 204. 

Standard deviation, 92, 128, 133. 
Standard error, 62. 

errors, agreement of, 242. 

Stars, colour and spectral type, 210. 
Ststirtical mechanics, 28, 369. 

Statistics, sufficient, 92. 
ancillary, 182. 
efficiency of, 179. 
unbiased, 177. 

Stebbing, L. S., 14. 

Stevens, W. L., 333. 

Stieltjes integral, 73. 

Stirling’s formula, 61. 

Storer, W. O., 127. 

Struggle for existence, 6. 

‘Student’, 94, 122, 206, 219, 271, 350, 
364. 

Succession, rule of, 110. 


Suggested values, 108. 

Survey, Ordnance, 176. 

t rule, 96, 122, 124, 128. 

significance, 242, 316, 319, 402, 403. 
Taylor, Sir Q. I., 332, 378, 382, 390. 
Telepathy, 333. 

Teodorescu, 268. 

Theory, 390. 

Thorbum, W. M., 316. 

Tidal friction, 390. 

Tires, strength of, 258. 

Titchmarsh, E. C., 76. 

Triangular distribution, 86. 

True value, 62. 

significance teste, 242. 

Turbulence, 332. 

Turner, H. H., 227. 

Twins, 238, 312. 

Uncertainty princijde, 13. 

Undistributed middle, 2, 39, 381. 
Unforeseen alternative, 39, 381. 

Uniform reduction, 192. 

Uniformity of Nature, 6, 11. 

Universal assent, 14. 

Unobservables, 383, 387. 

Venn limit, 11, 341, 345. 

Venus, node of, 362. 

Walker, Sir G. T., 229. 

Watson, G. N., 83. 

Weber and Fechner, 32. 

Weight, 124, 136. 

Weldon, W. F. B., 231, 314, 340. 

Whipple, F. J. W., 109, 328. 

Whitehead, A. N., see Principia Malhe- 
matica. 

Whittaker, Sir E. T., and Robinson, G-, 
80, 84, 202. 

Wish-fulfilment, 16, 392. 

Wrinch, D., 26. 53, 100, 112. 

Yamaguti, S., 327. 

Yates, F.. 209, 214, 219, 281, 357. 

Yule, G. Udny, 49, 69, 88, 96, 208, 239, 
326, 366. 

z rule, 95, 126. 
modification of, 97. 
significance, 266, 267, 404. 

8, definition, 48. 

X\ 85. 87, 363. 

x’ with estimated standard errors, 97. 

too small, 281. 

X'*, 170. 



PBINTEd IN 
GREAT BHITAIN 
AT thb: 

UNIVERSITY PRESS 
OXFORD 
BY 

CHARLES BATEY 
PRINTER 
TO THE 
UNIVERSITY 




